[PATCH 0/4] (x)diff cleanup: remove duplicate code

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 0/4] (x)diff cleanup: remove duplicate code
@ 2017-10-24 18:59 Stefan Beller
  2017-10-24 18:59 ` [PATCH 1/4] hashmap: introduce memhash_feed to access the internals of FNV-1 hash Stefan Beller
                   ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Stefan Beller @ 2017-10-24 18:59 UTC (permalink / raw)
  To: git; +Cc: Stefan Beller

Junio wrote:

>  * I was hoping that the next_byte() and string_hash() thing, once
>    they are cleaned up, will eventually be shared with the xdiff/
>    code at the lower layer, which needs to do pretty much the same
>    in order to implement various whitespace ignoring options.  I am
>    not sure how well the approach taken by the WIP patch meshes with
>    the needs of the lower layer.

This series does exactly this; although I chose to reuse the code in
xdiff/xutils.c instead of the new fancy next_byte/string_hash, as that
code has seen more exercise already (hence I assume it has fewer bugs
if any as well as its performance implications are well understood).

However to do so, we need to pollute xdiff/xutils.c and include
hashmap.h there (which also requires cache.h as that header has
an inline function using BUG()), which I find a bit unfortunate but
worth the trade off of using better tested code.

Thanks,
Stefan

Stefan Beller (4):
  hashmap: introduce memhash_feed to access the internals of FNV-1 hash
  xdiff-interface: export comparing and hashing strings
  xdiff: use stronger hash function internally
  diff.c: get rid of duplicate implementation

 diff.c            | 82 +++----------------------------------------------------
 hashmap.c         |  7 ++++-
 hashmap.h         |  3 ++
 xdiff-interface.c | 11 ++++++++
 xdiff-interface.h |  5 ++++
 xdiff/xutils.c    | 19 ++++++-------
 6 files changed, 37 insertions(+), 90 deletions(-)

-- 
2.15.0.rc2.6.g953226eb5f


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/4] hashmap: introduce memhash_feed to access the internals of FNV-1 hash
  2017-10-24 18:59 [PATCH 0/4] (x)diff cleanup: remove duplicate code Stefan Beller
@ 2017-10-24 18:59 ` Stefan Beller
  2017-10-24 20:23   ` René Scharfe
  2017-10-24 18:59 ` [PATCH 2/4] xdiff-interface: export comparing and hashing strings Stefan Beller
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: Stefan Beller @ 2017-10-24 18:59 UTC (permalink / raw)
  To: git; +Cc: Stefan Beller

This will be useful shortly.

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 hashmap.c | 7 ++++++-
 hashmap.h | 3 +++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/hashmap.c b/hashmap.c
index d42f01ff5a..d103eb1fd2 100644
--- a/hashmap.c
+++ b/hashmap.c
@@ -26,13 +26,18 @@ unsigned int strihash(const char *str)
 	return hash;
 }
 
+unsigned int memhash_feed(unsigned int hash_seed, const unsigned char next)
+{
+	return (hash_seed * FNV32_PRIME) ^ next;
+}
+
 unsigned int memhash(const void *buf, size_t len)
 {
 	unsigned int hash = FNV32_BASE;
 	unsigned char *ucbuf = (unsigned char *) buf;
 	while (len--) {
 		unsigned int c = *ucbuf++;
-		hash = (hash * FNV32_PRIME) ^ c;
+		hash = memhash_feed(hash, c);
 	}
 	return hash;
 }
diff --git a/hashmap.h b/hashmap.h
index 7cb29a6aed..c2464385ed 100644
--- a/hashmap.h
+++ b/hashmap.h
@@ -105,10 +105,13 @@
  * `strihash` and `memihash` are case insensitive versions.
  * `memihash_cont` is a variant of `memihash` that allows a computation to be
  * continued with another chunk of data.
+ * `memhash_feed` takes just one character and returns the hash based off
+ * a previous hash.
  */
 extern unsigned int strhash(const char *buf);
 extern unsigned int strihash(const char *buf);
 extern unsigned int memhash(const void *buf, size_t len);
+extern unsigned int memhash_feed(unsigned int hash_seed, const unsigned char next);
 extern unsigned int memihash(const void *buf, size_t len);
 extern unsigned int memihash_cont(unsigned int hash_seed, const void *buf, size_t len);
 
-- 
2.15.0.rc2.6.g953226eb5f


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/4] xdiff-interface: export comparing and hashing strings
  2017-10-24 18:59 [PATCH 0/4] (x)diff cleanup: remove duplicate code Stefan Beller
  2017-10-24 18:59 ` [PATCH 1/4] hashmap: introduce memhash_feed to access the internals of FNV-1 hash Stefan Beller
@ 2017-10-24 18:59 ` Stefan Beller
  2017-10-24 20:23   ` René Scharfe
  2017-10-24 18:59 ` [PATCH 3/4] xdiff: use stronger hash function internally Stefan Beller
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: Stefan Beller @ 2017-10-24 18:59 UTC (permalink / raw)
  To: git; +Cc: Stefan Beller

This will turn out to be useful in a later patch

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 xdiff-interface.c | 11 +++++++++++
 xdiff-interface.h |  5 +++++
 2 files changed, 16 insertions(+)

diff --git a/xdiff-interface.c b/xdiff-interface.c
index 018e033089..fd002ebbc2 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -5,6 +5,7 @@
 #include "xdiff/xdiffi.h"
 #include "xdiff/xemit.h"
 #include "xdiff/xmacros.h"
+#include "xdiff/xutils.h"
 
 struct xdiff_emit_state {
 	xdiff_emit_consume_fn consume;
@@ -296,6 +297,16 @@ void xdiff_clear_find_func(xdemitconf_t *xecfg)
 	}
 }
 
+unsigned long xdiff_hash_string(const char *s, size_t len, long flags)
+{
+	return xdl_hash_record(&s, s + len, flags);
+}
+
+int xdiff_compare_lines(const char *l1, long s1, const char *l2, long s2, long flags)
+{
+	return xdl_recmatch(l1, s1, l2, s2, flags);
+}
+
 int git_xmerge_style = -1;
 
 int git_xmerge_config(const char *var, const char *value, void *cb)
diff --git a/xdiff-interface.h b/xdiff-interface.h
index 6f6ba9095d..d3cb9285c5 100644
--- a/xdiff-interface.h
+++ b/xdiff-interface.h
@@ -29,4 +29,9 @@ extern void xdiff_clear_find_func(xdemitconf_t *xecfg);
 extern int git_xmerge_config(const char *var, const char *value, void *cb);
 extern int git_xmerge_style;
 
+extern int xdiff_compare_lines(const char *l1, long s1,
+			       const char *l2, long s2, long flags);
+
+extern unsigned long xdiff_hash_string(const char *s, size_t len, long flags);
+
 #endif
-- 
2.15.0.rc2.6.g953226eb5f


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/4] xdiff: use stronger hash function internally
  2017-10-24 18:59 [PATCH 0/4] (x)diff cleanup: remove duplicate code Stefan Beller
  2017-10-24 18:59 ` [PATCH 1/4] hashmap: introduce memhash_feed to access the internals of FNV-1 hash Stefan Beller
  2017-10-24 18:59 ` [PATCH 2/4] xdiff-interface: export comparing and hashing strings Stefan Beller
@ 2017-10-24 18:59 ` Stefan Beller
  2017-10-24 20:23   ` René Scharfe
  2017-10-24 23:04   ` Jonathan Nieder
  2017-10-24 18:59 ` [PATCH 4/4] diff.c: get rid of duplicate implementation Stefan Beller
  2017-10-24 19:02 ` [PATCH 0/4] (x)diff cleanup: remove duplicate code Stefan Beller
  4 siblings, 2 replies; 31+ messages in thread
From: Stefan Beller @ 2017-10-24 18:59 UTC (permalink / raw)
  To: git; +Cc: Stefan Beller

Instead of using the hash seeded with 5381, and updated via
`(hash << 5) ^ new_byte`, use the FNV-1 primitives as offered by
hashmap.h, which is seeded with 0x811c9dc5 and computed as
`(hash * 0x01000193) ^ new_byte`.

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 xdiff/xutils.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 04d7b32e4e..a58a28c687 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -24,7 +24,8 @@
 #include <assert.h>
 #include "xinclude.h"
 
-
+#include "cache.h"
+#include "hashmap.h"
 
 
 long xdl_bogosqrt(long n) {
@@ -228,7 +229,7 @@ int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags)
 
 static unsigned long xdl_hash_record_with_whitespace(char const **data,
 		char const *top, long flags) {
-	unsigned long ha = 5381;
+	unsigned long ha = memhash(NULL, 0);
 	char const *ptr = *data;
 
 	for (; ptr < top && *ptr != '\n'; ptr++) {
@@ -243,21 +244,18 @@ static unsigned long xdl_hash_record_with_whitespace(char const **data,
 				; /* already handled */
 			else if (flags & XDF_IGNORE_WHITESPACE_CHANGE
 				 && !at_eol) {
-				ha += (ha << 5);
-				ha ^= (unsigned long) ' ';
+				ha = memhash_feed(ha, (unsigned char) ' ');
 			}
 			else if (flags & XDF_IGNORE_WHITESPACE_AT_EOL
 				 && !at_eol) {
 				while (ptr2 != ptr + 1) {
-					ha += (ha << 5);
-					ha ^= (unsigned long) *ptr2;
+					ha = memhash_feed(ha, (unsigned char) *ptr2);
 					ptr2++;
 				}
 			}
 			continue;
 		}
-		ha += (ha << 5);
-		ha ^= (unsigned long) *ptr;
+		ha = memhash_feed(ha, (unsigned char) *ptr);
 	}
 	*data = ptr < top ? ptr + 1: ptr;
 
@@ -265,15 +263,14 @@ static unsigned long xdl_hash_record_with_whitespace(char const **data,
 }
 
 unsigned long xdl_hash_record(char const **data, char const *top, long flags) {
-	unsigned long ha = 5381;
+	unsigned long ha = memhash(NULL, 0);
 	char const *ptr = *data;
 
 	if (flags & XDF_WHITESPACE_FLAGS)
 		return xdl_hash_record_with_whitespace(data, top, flags);
 
 	for (; ptr < top && *ptr != '\n'; ptr++) {
-		ha += (ha << 5);
-		ha ^= (unsigned long) *ptr;
+		ha = memhash_feed(ha, (unsigned char) *ptr);
 	}
 	*data = ptr < top ? ptr + 1: ptr;
 
-- 
2.15.0.rc2.6.g953226eb5f


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/4] diff.c: get rid of duplicate implementation
  2017-10-24 18:59 [PATCH 0/4] (x)diff cleanup: remove duplicate code Stefan Beller
                   ` (2 preceding siblings ...)
  2017-10-24 18:59 ` [PATCH 3/4] xdiff: use stronger hash function internally Stefan Beller
@ 2017-10-24 18:59 ` Stefan Beller
  2017-10-24 19:02 ` [PATCH 0/4] (x)diff cleanup: remove duplicate code Stefan Beller
  4 siblings, 0 replies; 31+ messages in thread
From: Stefan Beller @ 2017-10-24 18:59 UTC (permalink / raw)
  To: git; +Cc: Stefan Beller

The implementations in diff.c to detect moved lines
needs to compare strings and hash strings, which is
implemented in that file, as well as in the xdiff
library.

Remove the rather recent implementation in diff.c
and rely on the well exercised code in the xdiff lib.

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 diff.c | 82 ++++--------------------------------------------------------------
 1 file changed, 4 insertions(+), 78 deletions(-)

diff --git a/diff.c b/diff.c
index c4a669ffa8..e6814b9e9c 100644
--- a/diff.c
+++ b/diff.c
@@ -707,88 +707,14 @@ struct moved_entry {
 	struct moved_entry *next_line;
 };
 
-static int next_byte(const char **cp, const char **endp,
-		     const struct diff_options *diffopt)
-{
-	int retval;
-
-	if (*cp >= *endp)
-		return -1;
-
-	if (isspace(**cp)) {
-		if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE_CHANGE)) {
-			while (*cp < *endp && isspace(**cp))
-				(*cp)++;
-			/*
-			 * After skipping a couple of whitespaces,
-			 * we still have to account for one space.
-			 */
-			return (int)' ';
-		}
-
-		if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE)) {
-			while (*cp < *endp && isspace(**cp))
-				(*cp)++;
-			/*
-			 * return the first non-ws character via the usual
-			 * below, unless we ate all of the bytes
-			 */
-			if (*cp >= *endp)
-				return -1;
-		}
-	}
-
-	retval = (unsigned char)(**cp);
-	(*cp)++;
-	return retval;
-}
-
 static int moved_entry_cmp(const struct diff_options *diffopt,
 			   const struct moved_entry *a,
 			   const struct moved_entry *b,
 			   const void *keydata)
 {
-	const char *ap = a->es->line, *ae = a->es->line + a->es->len;
-	const char *bp = b->es->line, *be = b->es->line + b->es->len;
-
-	if (!(diffopt->xdl_opts & XDF_WHITESPACE_FLAGS))
-		return a->es->len != b->es->len  || memcmp(ap, bp, a->es->len);
-
-	if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE_AT_EOL)) {
-		while (ae > ap && isspace(ae[-1]))
-			ae--;
-		while (be > bp && isspace(be[-1]))
-			be--;
-	}
-
-	while (1) {
-		int ca, cb;
-		ca = next_byte(&ap, &ae, diffopt);
-		cb = next_byte(&bp, &be, diffopt);
-		if (ca != cb)
-			return 1;
-		if (ca < 0)
-			return 0;
-	}
-}
-
-static unsigned get_string_hash(struct emitted_diff_symbol *es, struct diff_options *o)
-{
-	if (o->xdl_opts & XDF_WHITESPACE_FLAGS) {
-		static struct strbuf sb = STRBUF_INIT;
-		const char *ap = es->line, *ae = es->line + es->len;
-		int c;
-
-		strbuf_reset(&sb);
-		while (ae > ap && isspace(ae[-1]))
-			ae--;
-		while ((c = next_byte(&ap, &ae, o)) >= 0)
-			strbuf_addch(&sb, c);
-
-		return memhash(sb.buf, sb.len);
-	} else {
-		return memhash(es->line, es->len);
-	}
+	return !xdiff_compare_lines(a->es->line, a->es->len,
+				    b->es->line, b->es->len,
+				    diffopt->xdl_opts);
 }
 
 static struct moved_entry *prepare_entry(struct diff_options *o,
@@ -797,7 +723,7 @@ static struct moved_entry *prepare_entry(struct diff_options *o,
 	struct moved_entry *ret = xmalloc(sizeof(*ret));
 	struct emitted_diff_symbol *l = &o->emitted_symbols->buf[line_no];
 
-	ret->ent.hash = get_string_hash(l, o);
+	ret->ent.hash = xdiff_hash_string(l->line, l->len, o->xdl_opts);
 	ret->es = l;
 	ret->next_line = NULL;
 
-- 
2.15.0.rc2.6.g953226eb5f


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] (x)diff cleanup: remove duplicate code
  2017-10-24 18:59 [PATCH 0/4] (x)diff cleanup: remove duplicate code Stefan Beller
                   ` (3 preceding siblings ...)
  2017-10-24 18:59 ` [PATCH 4/4] diff.c: get rid of duplicate implementation Stefan Beller
@ 2017-10-24 19:02 ` Stefan Beller
  2017-10-24 23:42   ` [PATCHv2 0/2] " Stefan Beller
  2017-10-24 23:45   ` [PATCH 1/5] fnv: migrate code from hashmap to fnv Stefan Beller
  4 siblings, 2 replies; 31+ messages in thread
From: Stefan Beller @ 2017-10-24 19:02 UTC (permalink / raw)
  To: git@vger.kernel.org, Jeff King; +Cc: Stefan Beller

On Tue, Oct 24, 2017 at 11:59 AM, Stefan Beller <sbeller@google.com> wrote:
> Junio wrote:
>
>>  * I was hoping that the next_byte() and string_hash() thing, once
>>    they are cleaned up, will eventually be shared with the xdiff/
>>    code at the lower layer, which needs to do pretty much the same
>>    in order to implement various whitespace ignoring options.  I am
>>    not sure how well the approach taken by the WIP patch meshes with
>>    the needs of the lower layer.
>
> This series does exactly this; although I chose to reuse the code in
> xdiff/xutils.c instead of the new fancy next_byte/string_hash, as that
> code has seen more exercise already (hence I assume it has fewer bugs
> if any as well as its performance implications are well understood).
>
> However to do so, we need to pollute xdiff/xutils.c and include
> hashmap.h there (which also requires cache.h as that header has
> an inline function using BUG()), which I find a bit unfortunate but
> worth the trade off of using better tested code.
>

This, of course, applies on top of jk/diff-color-moved-fix.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] hashmap: introduce memhash_feed to access the internals of FNV-1 hash
  2017-10-24 18:59 ` [PATCH 1/4] hashmap: introduce memhash_feed to access the internals of FNV-1 hash Stefan Beller
@ 2017-10-24 20:23   ` René Scharfe
  2017-10-24 20:48     ` Stefan Beller
  0 siblings, 1 reply; 31+ messages in thread
From: René Scharfe @ 2017-10-24 20:23 UTC (permalink / raw)
  To: Stefan Beller, git

Am 24.10.2017 um 20:59 schrieb Stefan Beller:
> This will be useful shortly.
> 
> Signed-off-by: Stefan Beller <sbeller@google.com>
> ---
>   hashmap.c | 7 ++++++-
>   hashmap.h | 3 +++
>   2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/hashmap.c b/hashmap.c
> index d42f01ff5a..d103eb1fd2 100644
> --- a/hashmap.c
> +++ b/hashmap.c
> @@ -26,13 +26,18 @@ unsigned int strihash(const char *str)
>   	return hash;
>   }
>   
> +unsigned int memhash_feed(unsigned int hash_seed, const unsigned char next)

Why is the second parameter const and the first one isn't?  (We tend
not to bother with const for value types.)

> +{
> +	return (hash_seed * FNV32_PRIME) ^ next;
> +}
> +
>   unsigned int memhash(const void *buf, size_t len)
>   {
>   	unsigned int hash = FNV32_BASE;
>   	unsigned char *ucbuf = (unsigned char *) buf;
>   	while (len--) {
>   		unsigned int c = *ucbuf++;
> -		hash = (hash * FNV32_PRIME) ^ c;
> +		hash = memhash_feed(hash, c);

I guess compilers inline a copy of the function here with -O2.  My
knee-jerk reaction, however, is horror in the face of adding a function
call to the inner loop of a hash function.  Do you have performance
test results, ideally also with -O0?  And why not make memhash_feed()
an inline function or macro to sidestep that issue?

>   	}
>   	return hash;
>   }
> diff --git a/hashmap.h b/hashmap.h
> index 7cb29a6aed..c2464385ed 100644
> --- a/hashmap.h
> +++ b/hashmap.h
> @@ -105,10 +105,13 @@
>    * `strihash` and `memihash` are case insensitive versions.
>    * `memihash_cont` is a variant of `memihash` that allows a computation to be
>    * continued with another chunk of data.
> + * `memhash_feed` takes just one character and returns the hash based off
> + * a previous hash.
>    */
>   extern unsigned int strhash(const char *buf);
>   extern unsigned int strihash(const char *buf);
>   extern unsigned int memhash(const void *buf, size_t len);
> +extern unsigned int memhash_feed(unsigned int hash_seed, const unsigned char next);
>   extern unsigned int memihash(const void *buf, size_t len);
>   extern unsigned int memihash_cont(unsigned int hash_seed, const void *buf, size_t len);
>   
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] xdiff-interface: export comparing and hashing strings
  2017-10-24 18:59 ` [PATCH 2/4] xdiff-interface: export comparing and hashing strings Stefan Beller
@ 2017-10-24 20:23   ` René Scharfe
  2017-10-24 20:42     ` Stefan Beller
  0 siblings, 1 reply; 31+ messages in thread
From: René Scharfe @ 2017-10-24 20:23 UTC (permalink / raw)
  To: Stefan Beller, git

Am 24.10.2017 um 20:59 schrieb Stefan Beller:
> This will turn out to be useful in a later patch
> 
> Signed-off-by: Stefan Beller <sbeller@google.com>
> ---
>   xdiff-interface.c | 11 +++++++++++
>   xdiff-interface.h |  5 +++++
>   2 files changed, 16 insertions(+)
> 
> diff --git a/xdiff-interface.c b/xdiff-interface.c
> index 018e033089..fd002ebbc2 100644
> --- a/xdiff-interface.c
> +++ b/xdiff-interface.c
> @@ -5,6 +5,7 @@
>   #include "xdiff/xdiffi.h"
>   #include "xdiff/xemit.h"
>   #include "xdiff/xmacros.h"
> +#include "xdiff/xutils.h"
>   
>   struct xdiff_emit_state {
>   	xdiff_emit_consume_fn consume;
> @@ -296,6 +297,16 @@ void xdiff_clear_find_func(xdemitconf_t *xecfg)
>   	}
>   }
>   
> +unsigned long xdiff_hash_string(const char *s, size_t len, long flags)
> +{
> +	return xdl_hash_record(&s, s + len, flags);
> +}
> +
> +int xdiff_compare_lines(const char *l1, long s1, const char *l2, long s2, long flags)
> +{
> +	return xdl_recmatch(l1, s1, l2, s2, flags);
> +}

xdl_recmatch() is already exported; why not use it without this
wrapper?

> +
>   int git_xmerge_style = -1;
>   
>   int git_xmerge_config(const char *var, const char *value, void *cb)
> diff --git a/xdiff-interface.h b/xdiff-interface.h
> index 6f6ba9095d..d3cb9285c5 100644
> --- a/xdiff-interface.h
> +++ b/xdiff-interface.h
> @@ -29,4 +29,9 @@ extern void xdiff_clear_find_func(xdemitconf_t *xecfg);
>   extern int git_xmerge_config(const char *var, const char *value, void *cb);
>   extern int git_xmerge_style;
>   
> +extern int xdiff_compare_lines(const char *l1, long s1,
> +			       const char *l2, long s2, long flags);
> +
> +extern unsigned long xdiff_hash_string(const char *s, size_t len, long flags);

Documenting the meaning of their parameters would be nice.  s and len
are easy enough to guess, but which flags can be used?  At least a
pointer to their definition in xdiff/xdiff.h would be helpful.  And
renaming l1, s1, l2, s2 to a, alen, b, blen or line1, len1, line2, len2
or similar would leave me less confused, but perhaps that's just me.

> +
>   #endif
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] xdiff: use stronger hash function internally
  2017-10-24 18:59 ` [PATCH 3/4] xdiff: use stronger hash function internally Stefan Beller
@ 2017-10-24 20:23   ` René Scharfe
  2017-10-24 20:46     ` Stefan Beller
  2017-10-24 23:04   ` Jonathan Nieder
  1 sibling, 1 reply; 31+ messages in thread
From: René Scharfe @ 2017-10-24 20:23 UTC (permalink / raw)
  To: Stefan Beller, git

Am 24.10.2017 um 20:59 schrieb Stefan Beller:
> Instead of using the hash seeded with 5381, and updated via
> `(hash << 5) ^ new_byte`, use the FNV-1 primitives as offered by
> hashmap.h, which is seeded with 0x811c9dc5 and computed as
> `(hash * 0x01000193) ^ new_byte`.

The hash function you're replacing is called DJB2; I think that's worth
mentioning.

Performance test results would be nice.  No idea how to find edge cases,
though, or better: demonstrate a lack thereof.

> 
> Signed-off-by: Stefan Beller <sbeller@google.com>
> ---
>   xdiff/xutils.c | 19 ++++++++-----------
>   1 file changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/xdiff/xutils.c b/xdiff/xutils.c
> index 04d7b32e4e..a58a28c687 100644
> --- a/xdiff/xutils.c
> +++ b/xdiff/xutils.c
> @@ -24,7 +24,8 @@
>   #include <assert.h>
>   #include "xinclude.h"
>   
> -
> +#include "cache.h"
> +#include "hashmap.h"

Ouch.  Defining FNV32_BASE and FNV32_PRIME here would be much easier
overall.  And if that's too much duplication then those definitions
could be extracted into a new header file (fnv32.h?) included by both
hashmap.h and xutils.c.

>   
>   
>   long xdl_bogosqrt(long n) {
> @@ -228,7 +229,7 @@ int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags)
>   
>   static unsigned long xdl_hash_record_with_whitespace(char const **data,
>   		char const *top, long flags) {
> -	unsigned long ha = 5381;
> +	unsigned long ha = memhash(NULL, 0);
>   	char const *ptr = *data;
>   
>   	for (; ptr < top && *ptr != '\n'; ptr++) {
> @@ -243,21 +244,18 @@ static unsigned long xdl_hash_record_with_whitespace(char const **data,
>   				; /* already handled */
>   			else if (flags & XDF_IGNORE_WHITESPACE_CHANGE
>   				 && !at_eol) {
> -				ha += (ha << 5);
> -				ha ^= (unsigned long) ' ';
> +				ha = memhash_feed(ha, (unsigned char) ' ');

All the memhash_feed() callers in this file cast to unsigned char.  A
macro or a function (possibly inline) defined at the top could do
that for them.

>   			}
>   			else if (flags & XDF_IGNORE_WHITESPACE_AT_EOL
>   				 && !at_eol) {
>   				while (ptr2 != ptr + 1) {
> -					ha += (ha << 5);
> -					ha ^= (unsigned long) *ptr2;
> +					ha = memhash_feed(ha, (unsigned char) *ptr2);
>   					ptr2++;
>   				}
>   			}
>   			continue;
>   		}
> -		ha += (ha << 5);
> -		ha ^= (unsigned long) *ptr;
> +		ha = memhash_feed(ha, (unsigned char) *ptr);
>   	}
>   	*data = ptr < top ? ptr + 1: ptr;
>   
> @@ -265,15 +263,14 @@ static unsigned long xdl_hash_record_with_whitespace(char const **data,
>   }
>   
>   unsigned long xdl_hash_record(char const **data, char const *top, long flags) {
> -	unsigned long ha = 5381;
> +	unsigned long ha = memhash(NULL, 0);
>   	char const *ptr = *data;
>   
>   	if (flags & XDF_WHITESPACE_FLAGS)
>   		return xdl_hash_record_with_whitespace(data, top, flags);
>   
>   	for (; ptr < top && *ptr != '\n'; ptr++) {
> -		ha += (ha << 5);
> -		ha ^= (unsigned long) *ptr;
> +		ha = memhash_feed(ha, (unsigned char) *ptr);
>   	}
>   	*data = ptr < top ? ptr + 1: ptr;
>   
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] xdiff-interface: export comparing and hashing strings
  2017-10-24 20:23   ` René Scharfe
@ 2017-10-24 20:42     ` Stefan Beller
  2017-10-26 17:03       ` René Scharfe
  0 siblings, 1 reply; 31+ messages in thread
From: Stefan Beller @ 2017-10-24 20:42 UTC (permalink / raw)
  To: René Scharfe; +Cc: git

On Tue, Oct 24, 2017 at 1:23 PM, René Scharfe <l.s.r@web.de> wrote:

> xdl_recmatch() is already exported; why not use it without this
> wrapper?

It is exported in xdiff/xutils.h, to be used by various xdiff/*.c files, but
not outside of xdiff/. This one makes it available to the outside, too.

>> +extern unsigned long xdiff_hash_string(const char *s, size_t len, long flags);
>
> Documenting the meaning of their parameters would be nice.

I'll do that.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] xdiff: use stronger hash function internally
  2017-10-24 20:23   ` René Scharfe
@ 2017-10-24 20:46     ` Stefan Beller
  0 siblings, 0 replies; 31+ messages in thread
From: Stefan Beller @ 2017-10-24 20:46 UTC (permalink / raw)
  To: René Scharfe; +Cc: git

On Tue, Oct 24, 2017 at 1:23 PM, René Scharfe <l.s.r@web.de> wrote:
> Am 24.10.2017 um 20:59 schrieb Stefan Beller:
>> Instead of using the hash seeded with 5381, and updated via
>> `(hash << 5) ^ new_byte`, use the FNV-1 primitives as offered by
>> hashmap.h, which is seeded with 0x811c9dc5 and computed as
>> `(hash * 0x01000193) ^ new_byte`.
>
> The hash function you're replacing is called DJB2; I think that's worth
> mentioning.

I was not aware of the name. I'll look it up; thanks!

>
> Performance test results would be nice.  No idea how to find edge cases,
> though, or better: demonstrate a lack thereof.

My reasoning, though not in the commit message, is that the operations
are essentially equal, just with different numeric values, hence no impact.

I can look at the assembly and measure, too.

>
>>
>> Signed-off-by: Stefan Beller <sbeller@google.com>
>> ---
>>   xdiff/xutils.c | 19 ++++++++-----------
>>   1 file changed, 8 insertions(+), 11 deletions(-)
>>
>> diff --git a/xdiff/xutils.c b/xdiff/xutils.c
>> index 04d7b32e4e..a58a28c687 100644
>> --- a/xdiff/xutils.c
>> +++ b/xdiff/xutils.c
>> @@ -24,7 +24,8 @@
>>   #include <assert.h>
>>   #include "xinclude.h"
>>
>> -
>> +#include "cache.h"
>> +#include "hashmap.h"
>
> Ouch.  Defining FNV32_BASE and FNV32_PRIME here would be much easier
> overall.  And if that's too much duplication then those definitions
> could be extracted into a new header file (fnv32.h?) included by both
> hashmap.h and xutils.c.

I guess fnv32.h would do; it would contain the defines as well as the
static inline
function to be used in the inner loop of patch 1.

>
>>
>>
>>   long xdl_bogosqrt(long n) {
>> @@ -228,7 +229,7 @@ int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags)
>>
>>   static unsigned long xdl_hash_record_with_whitespace(char const **data,
>>               char const *top, long flags) {
>> -     unsigned long ha = 5381;
>> +     unsigned long ha = memhash(NULL, 0);
>>       char const *ptr = *data;
>>
>>       for (; ptr < top && *ptr != '\n'; ptr++) {
>> @@ -243,21 +244,18 @@ static unsigned long xdl_hash_record_with_whitespace(char const **data,
>>                               ; /* already handled */
>>                       else if (flags & XDF_IGNORE_WHITESPACE_CHANGE
>>                                && !at_eol) {
>> -                             ha += (ha << 5);
>> -                             ha ^= (unsigned long) ' ';
>> +                             ha = memhash_feed(ha, (unsigned char) ' ');
>
> All the memhash_feed() callers in this file cast to unsigned char.  A
> macro or a function (possibly inline) defined at the top could do
> that for them.

That would go away when using fnv32.h

Thanks for the review!
Stefan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] hashmap: introduce memhash_feed to access the internals of FNV-1 hash
  2017-10-24 20:23   ` René Scharfe
@ 2017-10-24 20:48     ` Stefan Beller
  0 siblings, 0 replies; 31+ messages in thread
From: Stefan Beller @ 2017-10-24 20:48 UTC (permalink / raw)
  To: René Scharfe; +Cc: git

>> +unsigned int memhash_feed(unsigned int hash_seed, const unsigned char next)
>
> Why is the second parameter const and the first one isn't?  (We tend
> not to bother with const for value types.)

will do.

>> +             hash = memhash_feed(hash, c);
>
> I guess compilers inline a copy of the function here with -O2.  My
> knee-jerk reaction, however, is horror in the face of adding a function
> call to the inner loop of a hash function.  Do you have performance
> test results, ideally also with -O0?  And why not make memhash_feed()
> an inline function or macro to sidestep that issue?

My gut feeling was similar, but then I assumed compilers of today would
be smart.

As per the discussion on a later patch, we could migrate all memhash* functions
to fnv32.c except for the _feed, which will be static in its header fnv32.h

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] xdiff: use stronger hash function internally
  2017-10-24 18:59 ` [PATCH 3/4] xdiff: use stronger hash function internally Stefan Beller
  2017-10-24 20:23   ` René Scharfe
@ 2017-10-24 23:04   ` Jonathan Nieder
  1 sibling, 0 replies; 31+ messages in thread
From: Jonathan Nieder @ 2017-10-24 23:04 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git

Stefan Beller wrote:

> --- a/xdiff/xutils.c
> +++ b/xdiff/xutils.c
> @@ -24,7 +24,8 @@
>  #include <assert.h>
>  #include "xinclude.h"
>  
> -
> +#include "cache.h"
> +#include "hashmap.h"

#includes like "git-compat-util.h" and "cache.h" can only be used as
the *first* #include.

Otherwise the feature test macros they set do not take effect, causing
lots of unpleasant platform-specific breakage.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCHv2 0/2] (x)diff cleanup: remove duplicate code
  2017-10-24 19:02 ` [PATCH 0/4] (x)diff cleanup: remove duplicate code Stefan Beller
@ 2017-10-24 23:42   ` Stefan Beller
  2017-10-24 23:42     ` [PATCH 1/2] xdiff-interface: export comparing and hashing strings Stefan Beller
                       ` (2 more replies)
  2017-10-24 23:45   ` [PATCH 1/5] fnv: migrate code from hashmap to fnv Stefan Beller
  1 sibling, 3 replies; 31+ messages in thread
From: Stefan Beller @ 2017-10-24 23:42 UTC (permalink / raw)
  To: sbeller; +Cc: git, peff, jrnieder, l.s.r

v2:
* I realized that we don't have to change the hashing function of xdiff;
  we can rather change the hashing function for the move detection,
  which is less fundamental.
  (That way I can shrink the series down to 2 patches)
* commented and renamed function parameters in the exposed xdiff functions.
* applies on top of jk/diff-color-moved-fix.

Thanks,
Stefan

v1:
Junio wrote:

>  * I was hoping that the next_byte() and string_hash() thing, once
>    they are cleaned up, will eventually be shared with the xdiff/
>    code at the lower layer, which needs to do pretty much the same
>    in order to implement various whitespace ignoring options.  I am
>    not sure how well the approach taken by the WIP patch meshes with
>    the needs of the lower layer.

This series does exactly this; although I chose to reuse the code in
xdiff/xutils.c instead of the new fancy next_byte/string_hash, as that
code has seen more exercise already (hence I assume it has fewer bugs
if any as well as its performance implications are well understood).

However to do so, we need to pollute xdiff/xutils.c and include
hashmap.h there (which also requires cache.h as that header has
an inline function using BUG()), which I find a bit unfortunate but
worth the trade off of using better tested code.

Thanks,
Stefan

Stefan Beller (2):
  xdiff-interface: export comparing and hashing strings
  diff.c: get rid of duplicate implementation

 diff.c            | 82 +++----------------------------------------------------
 xdiff-interface.c | 11 ++++++++
 xdiff-interface.h | 16 +++++++++++
 3 files changed, 31 insertions(+), 78 deletions(-)

-- 
2.15.0.rc2.6.g953226eb5f

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/2] xdiff-interface: export comparing and hashing strings
  2017-10-24 23:42   ` [PATCHv2 0/2] " Stefan Beller
@ 2017-10-24 23:42     ` Stefan Beller
  2017-10-25  4:26       ` Eric Sunshine
  2017-10-24 23:42     ` [PATCH 2/2] diff.c: get rid of duplicate implementation Stefan Beller
  2017-10-25  5:11     ` [PATCHv2 0/2] (x)diff cleanup: remove duplicate code Junio C Hamano
  2 siblings, 1 reply; 31+ messages in thread
From: Stefan Beller @ 2017-10-24 23:42 UTC (permalink / raw)
  To: sbeller; +Cc: git, peff, jrnieder, l.s.r

This will turn out to be useful in a later patch.

xdl_recmatch is exported in xdiff/xutils.h, to be used by various
xdiff/*.c files, but not outside of xdiff/. This one makes it available
to the outside, too.

While at it, add documentation.

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 xdiff-interface.c | 11 +++++++++++
 xdiff-interface.h | 16 ++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/xdiff-interface.c b/xdiff-interface.c
index 018e033089..9b35af2455 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -5,6 +5,7 @@
 #include "xdiff/xdiffi.h"
 #include "xdiff/xemit.h"
 #include "xdiff/xmacros.h"
+#include "xdiff/xutils.h"
 
 struct xdiff_emit_state {
 	xdiff_emit_consume_fn consume;
@@ -296,6 +297,16 @@ void xdiff_clear_find_func(xdemitconf_t *xecfg)
 	}
 }
 
+unsigned long xdiff_hash_string(const char *s, size_t len, long flags)
+{
+	return xdl_hash_record(&s, s + len, flags);
+}
+
+int xdiff_compare_lines(const char *a, long len_a, const char *b, long len_b, long flags)
+{
+	return xdl_recmatch(a, len_a, b, len_b, flags);
+}
+
 int git_xmerge_style = -1;
 
 int git_xmerge_config(const char *var, const char *value, void *cb)
diff --git a/xdiff-interface.h b/xdiff-interface.h
index 6f6ba9095d..6f5abaf8d3 100644
--- a/xdiff-interface.h
+++ b/xdiff-interface.h
@@ -29,4 +29,20 @@ extern void xdiff_clear_find_func(xdemitconf_t *xecfg);
 extern int git_xmerge_config(const char *var, const char *value, void *cb);
 extern int git_xmerge_style;
 
+/*
+ * Compare the strings l1 with l2 which are of size s1 and s2 respectively.
+ * Returns 1 if the strings are deemed equal, 0 otherwise.
+ * The `flags` given as XDF_WHITESPACE_FLAGS determine how white spaces
+ * are treated for the comparision.
+ */
+extern int xdiff_compare_lines(const char *a, long len_a,
+			       const char *b, long len_b, long flags);
+
+/*
+ * Returns a hash of the string s of length len.
+ * The `flags` given as XDF_WHITESPACE_FLAGS determine how white spaces
+ * are treated for the hash.
+ */
+extern unsigned long xdiff_hash_string(const char *s, size_t len, long flags);
+
 #endif
-- 
2.15.0.rc2.6.g953226eb5f


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/2] diff.c: get rid of duplicate implementation
  2017-10-24 23:42   ` [PATCHv2 0/2] " Stefan Beller
  2017-10-24 23:42     ` [PATCH 1/2] xdiff-interface: export comparing and hashing strings Stefan Beller
@ 2017-10-24 23:42     ` Stefan Beller
  2017-10-25  5:11     ` [PATCHv2 0/2] (x)diff cleanup: remove duplicate code Junio C Hamano
  2 siblings, 0 replies; 31+ messages in thread
From: Stefan Beller @ 2017-10-24 23:42 UTC (permalink / raw)
  To: sbeller; +Cc: git, peff, jrnieder, l.s.r

The implementations in diff.c to detect moved lines needs to compare
strings and hash strings, which is implemented in that file, as well
as in the xdiff library.

Remove the rather recent implementation in diff.c and rely on the well
exercised code in the xdiff lib.

With this change the hash used for bucketing the strings for the moved
line detection changes from FNV32 (that is provided via the hashmaps
memhash) to DJB2 (which is used internally in xdiff).  Benchmarks found
on the web[1] do not indicate that these hashes are different in
performance for readable strings.

[1] https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 diff.c | 82 ++++--------------------------------------------------------------
 1 file changed, 4 insertions(+), 78 deletions(-)

diff --git a/diff.c b/diff.c
index c4a669ffa8..e6814b9e9c 100644
--- a/diff.c
+++ b/diff.c
@@ -707,88 +707,14 @@ struct moved_entry {
 	struct moved_entry *next_line;
 };
 
-static int next_byte(const char **cp, const char **endp,
-		     const struct diff_options *diffopt)
-{
-	int retval;
-
-	if (*cp >= *endp)
-		return -1;
-
-	if (isspace(**cp)) {
-		if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE_CHANGE)) {
-			while (*cp < *endp && isspace(**cp))
-				(*cp)++;
-			/*
-			 * After skipping a couple of whitespaces,
-			 * we still have to account for one space.
-			 */
-			return (int)' ';
-		}
-
-		if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE)) {
-			while (*cp < *endp && isspace(**cp))
-				(*cp)++;
-			/*
-			 * return the first non-ws character via the usual
-			 * below, unless we ate all of the bytes
-			 */
-			if (*cp >= *endp)
-				return -1;
-		}
-	}
-
-	retval = (unsigned char)(**cp);
-	(*cp)++;
-	return retval;
-}
-
 static int moved_entry_cmp(const struct diff_options *diffopt,
 			   const struct moved_entry *a,
 			   const struct moved_entry *b,
 			   const void *keydata)
 {
-	const char *ap = a->es->line, *ae = a->es->line + a->es->len;
-	const char *bp = b->es->line, *be = b->es->line + b->es->len;
-
-	if (!(diffopt->xdl_opts & XDF_WHITESPACE_FLAGS))
-		return a->es->len != b->es->len  || memcmp(ap, bp, a->es->len);
-
-	if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE_AT_EOL)) {
-		while (ae > ap && isspace(ae[-1]))
-			ae--;
-		while (be > bp && isspace(be[-1]))
-			be--;
-	}
-
-	while (1) {
-		int ca, cb;
-		ca = next_byte(&ap, &ae, diffopt);
-		cb = next_byte(&bp, &be, diffopt);
-		if (ca != cb)
-			return 1;
-		if (ca < 0)
-			return 0;
-	}
-}
-
-static unsigned get_string_hash(struct emitted_diff_symbol *es, struct diff_options *o)
-{
-	if (o->xdl_opts & XDF_WHITESPACE_FLAGS) {
-		static struct strbuf sb = STRBUF_INIT;
-		const char *ap = es->line, *ae = es->line + es->len;
-		int c;
-
-		strbuf_reset(&sb);
-		while (ae > ap && isspace(ae[-1]))
-			ae--;
-		while ((c = next_byte(&ap, &ae, o)) >= 0)
-			strbuf_addch(&sb, c);
-
-		return memhash(sb.buf, sb.len);
-	} else {
-		return memhash(es->line, es->len);
-	}
+	return !xdiff_compare_lines(a->es->line, a->es->len,
+				    b->es->line, b->es->len,
+				    diffopt->xdl_opts);
 }
 
 static struct moved_entry *prepare_entry(struct diff_options *o,
@@ -797,7 +723,7 @@ static struct moved_entry *prepare_entry(struct diff_options *o,
 	struct moved_entry *ret = xmalloc(sizeof(*ret));
 	struct emitted_diff_symbol *l = &o->emitted_symbols->buf[line_no];
 
-	ret->ent.hash = get_string_hash(l, o);
+	ret->ent.hash = xdiff_hash_string(l->line, l->len, o->xdl_opts);
 	ret->es = l;
 	ret->next_line = NULL;
 
-- 
2.15.0.rc2.6.g953226eb5f


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 1/5] fnv: migrate code from hashmap to fnv
  2017-10-24 19:02 ` [PATCH 0/4] (x)diff cleanup: remove duplicate code Stefan Beller
  2017-10-24 23:42   ` [PATCHv2 0/2] " Stefan Beller
@ 2017-10-24 23:45   ` Stefan Beller
  1 sibling, 0 replies; 31+ messages in thread
From: Stefan Beller @ 2017-10-24 23:45 UTC (permalink / raw)
  To: sbeller; +Cc: git, peff, l.s.r

The functions regarding the Fowler–Noll–Vo hash function are put in a
separate compilation unit, as it is logically different from the hashmap
functionality.

Signed-off-by: Stefan Beller <sbeller@google.com>
---

This may still be an interesting cleanup on its own.

Thanks,
Stefan

 Makefile                |  1 +
 attr.c                  |  1 +
 builtin/fast-export.c   |  1 +
 diff.c                  |  1 +
 fnv.c                   | 64 +++++++++++++++++++++++++++++++++++++++++++++++++
 fnv.h                   | 20 ++++++++++++++++
 hashmap.c               | 64 +------------------------------------------------
 hashmap.h               | 15 ------------
 merge-recursive.c       |  1 +
 remote.c                |  1 +
 submodule-config.c      |  1 +
 t/helper/test-hashmap.c |  1 +
 12 files changed, 93 insertions(+), 78 deletions(-)
 create mode 100644 fnv.c
 create mode 100644 fnv.h

diff --git a/Makefile b/Makefile
index cd75985991..8e1c5988f3 100644
--- a/Makefile
+++ b/Makefile
@@ -793,6 +793,7 @@ LIB_OBJS += ewah/ewah_io.o
 LIB_OBJS += ewah/ewah_rlw.o
 LIB_OBJS += exec_cmd.o
 LIB_OBJS += fetch-pack.o
+LIB_OBJS += fnv.o
 LIB_OBJS += fsck.o
 LIB_OBJS += gettext.o
 LIB_OBJS += gpg-interface.o
diff --git a/attr.c b/attr.c
index dfc3a558d8..2e4217c4f1 100644
--- a/attr.c
+++ b/attr.c
@@ -16,6 +16,7 @@
 #include "utf8.h"
 #include "quote.h"
 #include "thread-utils.h"
+#include "fnv.h"
 
 const char git_attr__true[] = "(builtin)true";
 const char git_attr__false[] = "\0(builtin)false";
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 2fb60d6d48..62f4010510 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -21,6 +21,7 @@
 #include "quote.h"
 #include "remote.h"
 #include "blob.h"
+#include "fnv.h"
 
 static const char *fast_export_usage[] = {
 	N_("git fast-export [rev-list-opts]"),
diff --git a/diff.c b/diff.c
index c4a669ffa8..a23f4521fb 100644
--- a/diff.c
+++ b/diff.c
@@ -22,6 +22,7 @@
 #include "argv-array.h"
 #include "graph.h"
 #include "packfile.h"
+#include "fnv.h"
 
 #ifdef NO_FAST_WORKING_DIRECTORY
 #define FAST_WORKING_DIRECTORY 0
diff --git a/fnv.c b/fnv.c
new file mode 100644
index 0000000000..b4cbf39f0a
--- /dev/null
+++ b/fnv.c
@@ -0,0 +1,64 @@
+#include "fnv.h"
+
+#define FNV32_BASE ((unsigned int) 0x811c9dc5)
+#define FNV32_PRIME ((unsigned int) 0x01000193)
+
+unsigned int strhash(const char *str)
+{
+	unsigned int c, hash = FNV32_BASE;
+	while ((c = (unsigned char) *str++))
+		hash = (hash * FNV32_PRIME) ^ c;
+	return hash;
+}
+
+unsigned int strihash(const char *str)
+{
+	unsigned int c, hash = FNV32_BASE;
+	while ((c = (unsigned char) *str++)) {
+		if (c >= 'a' && c <= 'z')
+			c -= 'a' - 'A';
+		hash = (hash * FNV32_PRIME) ^ c;
+	}
+	return hash;
+}
+
+unsigned int memhash(const void *buf, size_t len)
+{
+	unsigned int hash = FNV32_BASE;
+	unsigned char *ucbuf = (unsigned char *) buf;
+	while (len--) {
+		unsigned int c = *ucbuf++;
+		hash = (hash * FNV32_PRIME) ^ c;
+	}
+	return hash;
+}
+
+unsigned int memihash(const void *buf, size_t len)
+{
+	unsigned int hash = FNV32_BASE;
+	unsigned char *ucbuf = (unsigned char *) buf;
+	while (len--) {
+		unsigned int c = *ucbuf++;
+		if (c >= 'a' && c <= 'z')
+			c -= 'a' - 'A';
+		hash = (hash * FNV32_PRIME) ^ c;
+	}
+	return hash;
+}
+
+/*
+ * Incoporate another chunk of data into a memihash
+ * computation.
+ */
+unsigned int memihash_cont(unsigned int hash_seed, const void *buf, size_t len)
+{
+	unsigned int hash = hash_seed;
+	unsigned char *ucbuf = (unsigned char *) buf;
+	while (len--) {
+		unsigned int c = *ucbuf++;
+		if (c >= 'a' && c <= 'z')
+			c -= 'a' - 'A';
+		hash = (hash * FNV32_PRIME) ^ c;
+	}
+	return hash;
+}
diff --git a/fnv.h b/fnv.h
new file mode 100644
index 0000000000..b425c85c66
--- /dev/null
+++ b/fnv.h
@@ -0,0 +1,20 @@
+#ifndef FNV_H
+#define FNV_H
+
+#include <stdlib.h>
+/*
+ * Ready-to-use hash functions for strings, using the FNV-1 algorithm (see
+ * http://www.isthe.com/chongo/tech/comp/fnv).
+ * `strhash` and `strihash` take 0-terminated strings, while `memhash` and
+ * `memihash` operate on arbitrary-length memory.
+ * `strihash` and `memihash` are case insensitive versions.
+ * `memihash_cont` is a variant of `memihash` that allows a computation to be
+ * continued with another chunk of data.
+ */
+extern unsigned int strhash(const char *buf);
+extern unsigned int strihash(const char *buf);
+extern unsigned int memhash(const void *buf, size_t len);
+extern unsigned int memihash(const void *buf, size_t len);
+extern unsigned int memihash_cont(unsigned int hash_seed, const void *buf, size_t len);
+
+#endif
diff --git a/hashmap.c b/hashmap.c
index d42f01ff5a..1605edbbc3 100644
--- a/hashmap.c
+++ b/hashmap.c
@@ -3,69 +3,7 @@
  */
 #include "cache.h"
 #include "hashmap.h"
-
-#define FNV32_BASE ((unsigned int) 0x811c9dc5)
-#define FNV32_PRIME ((unsigned int) 0x01000193)
-
-unsigned int strhash(const char *str)
-{
-	unsigned int c, hash = FNV32_BASE;
-	while ((c = (unsigned char) *str++))
-		hash = (hash * FNV32_PRIME) ^ c;
-	return hash;
-}
-
-unsigned int strihash(const char *str)
-{
-	unsigned int c, hash = FNV32_BASE;
-	while ((c = (unsigned char) *str++)) {
-		if (c >= 'a' && c <= 'z')
-			c -= 'a' - 'A';
-		hash = (hash * FNV32_PRIME) ^ c;
-	}
-	return hash;
-}
-
-unsigned int memhash(const void *buf, size_t len)
-{
-	unsigned int hash = FNV32_BASE;
-	unsigned char *ucbuf = (unsigned char *) buf;
-	while (len--) {
-		unsigned int c = *ucbuf++;
-		hash = (hash * FNV32_PRIME) ^ c;
-	}
-	return hash;
-}
-
-unsigned int memihash(const void *buf, size_t len)
-{
-	unsigned int hash = FNV32_BASE;
-	unsigned char *ucbuf = (unsigned char *) buf;
-	while (len--) {
-		unsigned int c = *ucbuf++;
-		if (c >= 'a' && c <= 'z')
-			c -= 'a' - 'A';
-		hash = (hash * FNV32_PRIME) ^ c;
-	}
-	return hash;
-}
-
-/*
- * Incoporate another chunk of data into a memihash
- * computation.
- */
-unsigned int memihash_cont(unsigned int hash_seed, const void *buf, size_t len)
-{
-	unsigned int hash = hash_seed;
-	unsigned char *ucbuf = (unsigned char *) buf;
-	while (len--) {
-		unsigned int c = *ucbuf++;
-		if (c >= 'a' && c <= 'z')
-			c -= 'a' - 'A';
-		hash = (hash * FNV32_PRIME) ^ c;
-	}
-	return hash;
-}
+#include "fnv.h"
 
 #define HASHMAP_INITIAL_SIZE 64
 /* grow / shrink by 2^2 */
diff --git a/hashmap.h b/hashmap.h
index 7cb29a6aed..96176f7d8c 100644
--- a/hashmap.h
+++ b/hashmap.h
@@ -97,21 +97,6 @@
  * }
  */
 
-/*
- * Ready-to-use hash functions for strings, using the FNV-1 algorithm (see
- * http://www.isthe.com/chongo/tech/comp/fnv).
- * `strhash` and `strihash` take 0-terminated strings, while `memhash` and
- * `memihash` operate on arbitrary-length memory.
- * `strihash` and `memihash` are case insensitive versions.
- * `memihash_cont` is a variant of `memihash` that allows a computation to be
- * continued with another chunk of data.
- */
-extern unsigned int strhash(const char *buf);
-extern unsigned int strihash(const char *buf);
-extern unsigned int memhash(const void *buf, size_t len);
-extern unsigned int memihash(const void *buf, size_t len);
-extern unsigned int memihash_cont(unsigned int hash_seed, const void *buf, size_t len);
-
 /*
  * Converts a cryptographic hash (e.g. SHA-1) into an int-sized hash code
  * for use in hash tables. Cryptographic hashes are supposed to have
diff --git a/merge-recursive.c b/merge-recursive.c
index 1d3f8f0d22..d1fb2f5f3d 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -22,6 +22,7 @@
 #include "attr.h"
 #include "merge-recursive.h"
 #include "dir.h"
+#include "fnv.h"
 #include "submodule.h"
 
 struct path_hashmap_entry {
diff --git a/remote.c b/remote.c
index b220f0dfc6..24957ba32c 100644
--- a/remote.c
+++ b/remote.c
@@ -10,6 +10,7 @@
 #include "string-list.h"
 #include "mergesort.h"
 #include "argv-array.h"
+#include "fnv.h"
 
 enum map_direction { FROM_SRC, FROM_DST };
 
diff --git a/submodule-config.c b/submodule-config.c
index 2aa8a1747f..432423f876 100644
--- a/submodule-config.c
+++ b/submodule-config.c
@@ -5,6 +5,7 @@
 #include "submodule.h"
 #include "strbuf.h"
 #include "parse-options.h"
+#include "fnv.h"
 
 /*
  * submodule cache lookup structure
diff --git a/t/helper/test-hashmap.c b/t/helper/test-hashmap.c
index 1145d51671..b5ac97886f 100644
--- a/t/helper/test-hashmap.c
+++ b/t/helper/test-hashmap.c
@@ -1,5 +1,6 @@
 #include "git-compat-util.h"
 #include "hashmap.h"
+#include "fnv.h"
 
 struct test_entry
 {
-- 
2.15.0.rc2.6.g953226eb5f


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/2] xdiff-interface: export comparing and hashing strings
  2017-10-24 23:42     ` [PATCH 1/2] xdiff-interface: export comparing and hashing strings Stefan Beller
@ 2017-10-25  4:26       ` Eric Sunshine
  0 siblings, 0 replies; 31+ messages in thread
From: Eric Sunshine @ 2017-10-25  4:26 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Git List, Jeff King, Jonathan Nieder, René Scharfe

On Tue, Oct 24, 2017 at 7:42 PM, Stefan Beller <sbeller@google.com> wrote:
> This will turn out to be useful in a later patch.
>
> xdl_recmatch is exported in xdiff/xutils.h, to be used by various
> xdiff/*.c files, but not outside of xdiff/. This one makes it available
> to the outside, too.
>
> While at it, add documentation.
>
> Signed-off-by: Stefan Beller <sbeller@google.com>
> ---
> diff --git a/xdiff-interface.h b/xdiff-interface.h
> @@ -29,4 +29,20 @@ extern void xdiff_clear_find_func(xdemitconf_t *xecfg);
> +/*
> + * Compare the strings l1 with l2 which are of size s1 and s2 respectively.
> + * Returns 1 if the strings are deemed equal, 0 otherwise.
> + * The `flags` given as XDF_WHITESPACE_FLAGS determine how white spaces
> + * are treated for the comparision.
> + */
> +extern int xdiff_compare_lines(const char *a, long len_a,
> +                              const char *b, long len_b, long flags);

The comment block talks about 'l1', 'l2', 's1', and 's2', but the
declaration uses 'a', 'b, 'len_a', 'len_b'. Confusing.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCHv2 0/2] (x)diff cleanup: remove duplicate code
  2017-10-24 23:42   ` [PATCHv2 0/2] " Stefan Beller
  2017-10-24 23:42     ` [PATCH 1/2] xdiff-interface: export comparing and hashing strings Stefan Beller
  2017-10-24 23:42     ` [PATCH 2/2] diff.c: get rid of duplicate implementation Stefan Beller
@ 2017-10-25  5:11     ` Junio C Hamano
  2017-10-25 18:47       ` Stefan Beller
  2017-10-25 18:49       ` [PATCHv3 " Stefan Beller
  2 siblings, 2 replies; 31+ messages in thread
From: Junio C Hamano @ 2017-10-25  5:11 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git, peff, jrnieder, l.s.r

Stefan Beller <sbeller@google.com> writes:

> v2:
> * I realized that we don't have to change the hashing function of xdiff;
>   we can rather change the hashing function for the move detection,
>   which is less fundamental.
>   (That way I can shrink the series down to 2 patches)
> * commented and renamed function parameters in the exposed xdiff functions.
> * applies on top of jk/diff-color-moved-fix.

Yes, by reusing the line hashing and comparison from xdiff/ we can
ensure that we will use consistent comparison function, and the
thing we need to focus on will become how correctly the caller uses
the xdiff interface.  This looks much better than the previous one.

Eric's comment on the function parameters is right.  We keep them in
sync with the naming convention of xdiff/ as long as they are still
part of xdiff layer, and the convention there is that the lines
being compared are l1[] and l2[] whose lengths are s1 and s2, if I
am not mistaken (well, I am not, as I just touched the function
there during my lunch break ;-).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCHv2 0/2] (x)diff cleanup: remove duplicate code
  2017-10-25  5:11     ` [PATCHv2 0/2] (x)diff cleanup: remove duplicate code Junio C Hamano
@ 2017-10-25 18:47       ` Stefan Beller
  2017-10-25 18:49       ` [PATCHv3 " Stefan Beller
  1 sibling, 0 replies; 31+ messages in thread
From: Stefan Beller @ 2017-10-25 18:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jeff King, Jonathan Nieder, René Scharfe

On Tue, Oct 24, 2017 at 10:11 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Stefan Beller <sbeller@google.com> writes:
>
>> v2:
>> * I realized that we don't have to change the hashing function of xdiff;
>>   we can rather change the hashing function for the move detection,
>>   which is less fundamental.
>>   (That way I can shrink the series down to 2 patches)
>> * commented and renamed function parameters in the exposed xdiff functions.
>> * applies on top of jk/diff-color-moved-fix.
>
> Yes, by reusing the line hashing and comparison from xdiff/ we can
> ensure that we will use consistent comparison function, and the
> thing we need to focus on will become how correctly the caller uses
> the xdiff interface.  This looks much better than the previous one.
>
> Eric's comment on the function parameters is right.  We keep them in
> sync with the naming convention of xdiff/ as long as they are still
> part of xdiff layer, and the convention there is that the lines
> being compared are l1[] and l2[] whose lengths are s1 and s2, if I
> am not mistaken (well, I am not, as I just touched the function
> there during my lunch break ;-).
>

Yes, I am torn on the naming as Rene commented on it and did not like
the (l1, s1) and rather would see a (a, alen) / (b, blen) approach.
And I thought that was a good idea initially as that patch is explicitly about
exposing the internals of xdiff to the world. So when the world copes better
with (a, alen), then the function added in xdiff-interface.c should take that
parameter as to not confuse the outside world. And one could argue
that xdiff-interface is not part of the xdiff layer any more as that is the
glue-gap to the rest of Git.

However I agree that we may want to keep l1[], s1 at that place, as the
rest of the xdiff interface looks like it is kept in line with xdiff.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCHv3 0/2] (x)diff cleanup: remove duplicate code
  2017-10-25  5:11     ` [PATCHv2 0/2] (x)diff cleanup: remove duplicate code Junio C Hamano
  2017-10-25 18:47       ` Stefan Beller
@ 2017-10-25 18:49       ` Stefan Beller
  2017-10-25 18:49         ` [PATCH 1/2] xdiff-interface: export comparing and hashing strings Stefan Beller
  2017-10-25 18:49         ` [PATCH 2/2] diff.c: get rid of duplicate implementation Stefan Beller
  1 sibling, 2 replies; 31+ messages in thread
From: Stefan Beller @ 2017-10-25 18:49 UTC (permalink / raw)
  To: gitster; +Cc: git, jrnieder, l.s.r, peff, sbeller

V3:
* changed parameter names back to xdiff standard

v2:
* I realized that we don't have to change the hashing function of xdiff;
  we can rather change the hashing function for the move detection,
  which is less fundamental.
  (That way I can shrink the series down to 2 patches)
* commented and renamed function parameters in the exposed xdiff functions.
* applies on top of jk/diff-color-moved-fix.

Thanks,
Stefan

v1:
Junio wrote:

>  * I was hoping that the next_byte() and string_hash() thing, once
>    they are cleaned up, will eventually be shared with the xdiff/
>    code at the lower layer, which needs to do pretty much the same
>    in order to implement various whitespace ignoring options.  I am
>    not sure how well the approach taken by the WIP patch meshes with
>    the needs of the lower layer.

This series does exactly this; although I chose to reuse the code in
xdiff/xutils.c instead of the new fancy next_byte/string_hash, as that
code has seen more exercise already (hence I assume it has fewer bugs
if any as well as its performance implications are well understood).

However to do so, we need to pollute xdiff/xutils.c and include
hashmap.h there (which also requires cache.h as that header has
an inline function using BUG()), which I find a bit unfortunate but
worth the trade off of using better tested code.

Thanks,
Stefan

Stefan Beller (2):
  xdiff-interface: export comparing and hashing strings
  diff.c: get rid of duplicate implementation

 diff.c            | 82 +++----------------------------------------------------
 xdiff-interface.c | 12 ++++++++
 xdiff-interface.h | 16 +++++++++++
 3 files changed, 32 insertions(+), 78 deletions(-)

-- 
2.15.0.rc2.6.g953226eb5f

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/2] xdiff-interface: export comparing and hashing strings
  2017-10-25 18:49       ` [PATCHv3 " Stefan Beller
@ 2017-10-25 18:49         ` Stefan Beller
  2017-10-26 17:12           ` René Scharfe
  2017-10-25 18:49         ` [PATCH 2/2] diff.c: get rid of duplicate implementation Stefan Beller
  1 sibling, 1 reply; 31+ messages in thread
From: Stefan Beller @ 2017-10-25 18:49 UTC (permalink / raw)
  To: gitster; +Cc: git, jrnieder, l.s.r, peff, sbeller

This will turn out to be useful in a later patch.

xdl_recmatch is exported in xdiff/xutils.h, to be used by various
xdiff/*.c files, but not outside of xdiff/. This one makes it available
to the outside, too.

While at it, add documentation.

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 xdiff-interface.c | 12 ++++++++++++
 xdiff-interface.h | 16 ++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/xdiff-interface.c b/xdiff-interface.c
index 018e033089..770e1f7f81 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -5,6 +5,7 @@
 #include "xdiff/xdiffi.h"
 #include "xdiff/xemit.h"
 #include "xdiff/xmacros.h"
+#include "xdiff/xutils.h"
 
 struct xdiff_emit_state {
 	xdiff_emit_consume_fn consume;
@@ -296,6 +297,17 @@ void xdiff_clear_find_func(xdemitconf_t *xecfg)
 	}
 }
 
+unsigned long xdiff_hash_string(const char *s, size_t len, long flags)
+{
+	return xdl_hash_record(&s, s + len, flags);
+}
+
+int xdiff_compare_lines(const char *l1, long s1,
+			const char *l2, long s2, long flags)
+{
+	return xdl_recmatch(l1, s1, l2, s2, flags);
+}
+
 int git_xmerge_style = -1;
 
 int git_xmerge_config(const char *var, const char *value, void *cb)
diff --git a/xdiff-interface.h b/xdiff-interface.h
index 6f6ba9095d..135fc05d72 100644
--- a/xdiff-interface.h
+++ b/xdiff-interface.h
@@ -29,4 +29,20 @@ extern void xdiff_clear_find_func(xdemitconf_t *xecfg);
 extern int git_xmerge_config(const char *var, const char *value, void *cb);
 extern int git_xmerge_style;
 
+/*
+ * Compare the strings l1 with l2 which are of size s1 and s2 respectively.
+ * Returns 1 if the strings are deemed equal, 0 otherwise.
+ * The `flags` given as XDF_WHITESPACE_FLAGS determine how white spaces
+ * are treated for the comparision.
+ */
+extern int xdiff_compare_lines(const char *l1, long s1,
+			       const char *l2, long s2, long flags);
+
+/*
+ * Returns a hash of the string s of length len.
+ * The `flags` given as XDF_WHITESPACE_FLAGS determine how white spaces
+ * are treated for the hash.
+ */
+extern unsigned long xdiff_hash_string(const char *s, size_t len, long flags);
+
 #endif
-- 
2.15.0.rc2.6.g953226eb5f


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/2] diff.c: get rid of duplicate implementation
  2017-10-25 18:49       ` [PATCHv3 " Stefan Beller
  2017-10-25 18:49         ` [PATCH 1/2] xdiff-interface: export comparing and hashing strings Stefan Beller
@ 2017-10-25 18:49         ` Stefan Beller
  2017-10-26  2:23           ` Junio C Hamano
  2017-10-26 17:43           ` René Scharfe
  1 sibling, 2 replies; 31+ messages in thread
From: Stefan Beller @ 2017-10-25 18:49 UTC (permalink / raw)
  To: gitster; +Cc: git, jrnieder, l.s.r, peff, sbeller

The implementations in diff.c to detect moved lines needs to compare
strings and hash strings, which is implemented in that file, as well
as in the xdiff library.

Remove the rather recent implementation in diff.c and rely on the well
exercised code in the xdiff lib.

With this change the hash used for bucketing the strings for the moved
line detection changes from FNV32 (that is provided via the hashmaps
memhash) to DJB2 (which is used internally in xdiff).  Benchmarks found
on the web[1] do not indicate that these hashes are different in
performance for readable strings.

[1] https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 diff.c | 82 ++++--------------------------------------------------------------
 1 file changed, 4 insertions(+), 78 deletions(-)

diff --git a/diff.c b/diff.c
index c4a669ffa8..e6814b9e9c 100644
--- a/diff.c
+++ b/diff.c
@@ -707,88 +707,14 @@ struct moved_entry {
 	struct moved_entry *next_line;
 };
 
-static int next_byte(const char **cp, const char **endp,
-		     const struct diff_options *diffopt)
-{
-	int retval;
-
-	if (*cp >= *endp)
-		return -1;
-
-	if (isspace(**cp)) {
-		if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE_CHANGE)) {
-			while (*cp < *endp && isspace(**cp))
-				(*cp)++;
-			/*
-			 * After skipping a couple of whitespaces,
-			 * we still have to account for one space.
-			 */
-			return (int)' ';
-		}
-
-		if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE)) {
-			while (*cp < *endp && isspace(**cp))
-				(*cp)++;
-			/*
-			 * return the first non-ws character via the usual
-			 * below, unless we ate all of the bytes
-			 */
-			if (*cp >= *endp)
-				return -1;
-		}
-	}
-
-	retval = (unsigned char)(**cp);
-	(*cp)++;
-	return retval;
-}
-
 static int moved_entry_cmp(const struct diff_options *diffopt,
 			   const struct moved_entry *a,
 			   const struct moved_entry *b,
 			   const void *keydata)
 {
-	const char *ap = a->es->line, *ae = a->es->line + a->es->len;
-	const char *bp = b->es->line, *be = b->es->line + b->es->len;
-
-	if (!(diffopt->xdl_opts & XDF_WHITESPACE_FLAGS))
-		return a->es->len != b->es->len  || memcmp(ap, bp, a->es->len);
-
-	if (DIFF_XDL_TST(diffopt, IGNORE_WHITESPACE_AT_EOL)) {
-		while (ae > ap && isspace(ae[-1]))
-			ae--;
-		while (be > bp && isspace(be[-1]))
-			be--;
-	}
-
-	while (1) {
-		int ca, cb;
-		ca = next_byte(&ap, &ae, diffopt);
-		cb = next_byte(&bp, &be, diffopt);
-		if (ca != cb)
-			return 1;
-		if (ca < 0)
-			return 0;
-	}
-}
-
-static unsigned get_string_hash(struct emitted_diff_symbol *es, struct diff_options *o)
-{
-	if (o->xdl_opts & XDF_WHITESPACE_FLAGS) {
-		static struct strbuf sb = STRBUF_INIT;
-		const char *ap = es->line, *ae = es->line + es->len;
-		int c;
-
-		strbuf_reset(&sb);
-		while (ae > ap && isspace(ae[-1]))
-			ae--;
-		while ((c = next_byte(&ap, &ae, o)) >= 0)
-			strbuf_addch(&sb, c);
-
-		return memhash(sb.buf, sb.len);
-	} else {
-		return memhash(es->line, es->len);
-	}
+	return !xdiff_compare_lines(a->es->line, a->es->len,
+				    b->es->line, b->es->len,
+				    diffopt->xdl_opts);
 }
 
 static struct moved_entry *prepare_entry(struct diff_options *o,
@@ -797,7 +723,7 @@ static struct moved_entry *prepare_entry(struct diff_options *o,
 	struct moved_entry *ret = xmalloc(sizeof(*ret));
 	struct emitted_diff_symbol *l = &o->emitted_symbols->buf[line_no];
 
-	ret->ent.hash = get_string_hash(l, o);
+	ret->ent.hash = xdiff_hash_string(l->line, l->len, o->xdl_opts);
 	ret->es = l;
 	ret->next_line = NULL;
 
-- 
2.15.0.rc2.6.g953226eb5f


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] diff.c: get rid of duplicate implementation
  2017-10-25 18:49         ` [PATCH 2/2] diff.c: get rid of duplicate implementation Stefan Beller
@ 2017-10-26  2:23           ` Junio C Hamano
  2017-10-26 17:43           ` René Scharfe
  1 sibling, 0 replies; 31+ messages in thread
From: Junio C Hamano @ 2017-10-26  2:23 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git, jrnieder, l.s.r, peff

Stefan Beller <sbeller@google.com> writes:

> The implementations in diff.c to detect moved lines needs to compare
> strings and hash strings, which is implemented in that file, as well
> as in the xdiff library.
>
> Remove the rather recent implementation in diff.c and rely on the well
> exercised code in the xdiff lib.
> ...
>  static int moved_entry_cmp(const struct diff_options *diffopt,
>  			   const struct moved_entry *a,
>  			   const struct moved_entry *b,
>  			   const void *keydata)
>  {
> -	const char *ap = a->es->line, *ae = a->es->line + a->es->len;
> -	const char *bp = b->es->line, *be = b->es->line + b->es->len;
> - ...
> +	return !xdiff_compare_lines(a->es->line, a->es->len,
> +				    b->es->line, b->es->len,
> +				    diffopt->xdl_opts);
>  }

OK, xdiff's xdl_recmatch() takes counted strings, and line[len] is
one byte beyond the end of the line (where LF typically sits), which
is the same convention as how "emitted_symbol" represents a line, so
not just the implementation replaced with a known-working one, but
also the code calls the helper with correct calling convention ;-)

> -	ret->ent.hash = get_string_hash(l, o);
> +	ret->ent.hash = xdiff_hash_string(l->line, l->len, o->xdl_opts);

Likewise.  Looks good.

Will queue.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] xdiff-interface: export comparing and hashing strings
  2017-10-24 20:42     ` Stefan Beller
@ 2017-10-26 17:03       ` René Scharfe
  0 siblings, 0 replies; 31+ messages in thread
From: René Scharfe @ 2017-10-26 17:03 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git

Am 24.10.2017 um 22:42 schrieb Stefan Beller:
> On Tue, Oct 24, 2017 at 1:23 PM, René Scharfe <l.s.r@web.de> wrote:
> 
>> xdl_recmatch() is already exported; why not use it without this
>> wrapper?
> 
> It is exported in xdiff/xutils.h, to be used by various xdiff/*.c files, but
> not outside of xdiff/. This one makes it available to the outside, too.

Ah, right, somehow I mixed that up with xdiff/xdiff.h, which is already
included by two builtins.

René

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/2] xdiff-interface: export comparing and hashing strings
  2017-10-25 18:49         ` [PATCH 1/2] xdiff-interface: export comparing and hashing strings Stefan Beller
@ 2017-10-26 17:12           ` René Scharfe
  2017-10-27  7:12             ` Junio C Hamano
  0 siblings, 1 reply; 31+ messages in thread
From: René Scharfe @ 2017-10-26 17:12 UTC (permalink / raw)
  To: Stefan Beller, gitster; +Cc: git, jrnieder, peff

Am 25.10.2017 um 20:49 schrieb Stefan Beller:
> +/*
> + * Compare the strings l1 with l2 which are of size s1 and s2 respectively.
> + * Returns 1 if the strings are deemed equal, 0 otherwise.
> + * The `flags` given as XDF_WHITESPACE_FLAGS determine how white spaces
> + * are treated for the comparision.
> + */
> +extern int xdiff_compare_lines(const char *l1, long s1,
> +			       const char *l2, long s2, long flags);

With the added comment it's OK here.

Still, I find the tendency in libxdiff to use the shortest possible
variable names to be hard on the eyes.  That math-like notation may have
its place in that external library, but I think we should be careful
lest it spreads.

René

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] diff.c: get rid of duplicate implementation
  2017-10-25 18:49         ` [PATCH 2/2] diff.c: get rid of duplicate implementation Stefan Beller
  2017-10-26  2:23           ` Junio C Hamano
@ 2017-10-26 17:43           ` René Scharfe
  2017-10-30 17:59             ` Jeff King
  1 sibling, 1 reply; 31+ messages in thread
From: René Scharfe @ 2017-10-26 17:43 UTC (permalink / raw)
  To: Stefan Beller, gitster; +Cc: git, jrnieder, peff

Am 25.10.2017 um 20:49 schrieb Stefan Beller:
> The implementations in diff.c to detect moved lines needs to compare
> strings and hash strings, which is implemented in that file, as well
> as in the xdiff library.
> 
> Remove the rather recent implementation in diff.c and rely on the well
> exercised code in the xdiff lib.
> 
> With this change the hash used for bucketing the strings for the moved
> line detection changes from FNV32 (that is provided via the hashmaps
> memhash) to DJB2 (which is used internally in xdiff).  Benchmarks found
> on the web[1] do not indicate that these hashes are different in
> performance for readable strings.
> 
> [1] https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Awesome comparison!  It calls the variant used in libxdiff DJB2a (which
uses xor to mix in the new char) instead of DJB2 (which uses addition).

There's also https://www.strchr.com/hash_functions, which lists DJB2
as Bernstein.  Both functions rank somewhere in the middle of that list.

> Signed-off-by: Stefan Beller <sbeller@google.com>
> ---
>   diff.c | 82 ++++--------------------------------------------------------------
>   1 file changed, 4 insertions(+), 78 deletions(-)

Very nice.

René

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/2] xdiff-interface: export comparing and hashing strings
  2017-10-26 17:12           ` René Scharfe
@ 2017-10-27  7:12             ` Junio C Hamano
  2017-10-27 17:15               ` Stefan Beller
  0 siblings, 1 reply; 31+ messages in thread
From: Junio C Hamano @ 2017-10-27  7:12 UTC (permalink / raw)
  To: Stefan Beller, René Scharfe; +Cc: git, jrnieder, peff

René Scharfe <l.s.r@web.de> writes:

> Am 25.10.2017 um 20:49 schrieb Stefan Beller:
>> +/*
>> + * Compare the strings l1 with l2 which are of size s1 and s2 respectively.
>> + * Returns 1 if the strings are deemed equal, 0 otherwise.
>> + * The `flags` given as XDF_WHITESPACE_FLAGS determine how white spaces
>> + * are treated for the comparision.
>> + */
>> +extern int xdiff_compare_lines(const char *l1, long s1,
>> +			       const char *l2, long s2, long flags);
>
> With the added comment it's OK here.
>
> Still, I find the tendency in libxdiff to use the shortest possible
> variable names to be hard on the eyes.  That math-like notation may have
> its place in that external library, but I think we should be careful
> lest it spreads.

Yeah, I tend to agree.  The xdiff-interface is at the (surprise!)
interface layer, so we could make it follow the style of either one,
but we seem to have picked up the convention of the lower layer,
so...

By the way, I was looking at the code around this area while
reviewing the cr-at-eol thing, and noticed a couple of things:

 * combine-diff.c special cases only IGNORE_WHITESPACE and
   IGNORE_WHITESPACE_CHANGE but no other IGNORE_WHITESPACE things; I
   have a suspicion that this is not because IGNORE_WHITESPACE_AT_EOL
   does not have to special cased by the codepath, but only because
   the combine-diff.c refactoring predates the introduction of
   ws-at-eol thing?

   The machinery uses its own match_string_spaces() helper; it may
   make sense to use the same xdiff_compare_lines() in its callers
   and get rid of this helper function.

 * diff_setup_done() sets DIFF_FROM_CONTENTS when any of the
   IGNORE_WHITESPACE bits is true, to allow "git diff -q
   --ignore-something" would do the right thing.  We do not however
   give a similar special case to XDF_IGNORE_BLANK_LINES.

   $ echo >>Makefile && git diff $option --ignore-blank-lines Makefile

   exits with 1 when option=--exit-code and it exits with 0 when
   option=-q

For now I'll leve these as #leftoverbits, but I may come back to the
latter soonish.  I won't come back to the former until Stefan's
refactor hits 'master', though.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/2] xdiff-interface: export comparing and hashing strings
  2017-10-27  7:12             ` Junio C Hamano
@ 2017-10-27 17:15               ` Stefan Beller
  0 siblings, 0 replies; 31+ messages in thread
From: Stefan Beller @ 2017-10-27 17:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: René Scharfe, git, Jonathan Nieder, Jeff King

On Fri, Oct 27, 2017 at 12:12 AM, Junio C Hamano <gitster@pobox.com> wrote:
> René Scharfe <l.s.r@web.de> writes:
>
>> Am 25.10.2017 um 20:49 schrieb Stefan Beller:
>>> +/*
>>> + * Compare the strings l1 with l2 which are of size s1 and s2 respectively.
>>> + * Returns 1 if the strings are deemed equal, 0 otherwise.
>>> + * The `flags` given as XDF_WHITESPACE_FLAGS determine how white spaces
>>> + * are treated for the comparision.
>>> + */
>>> +extern int xdiff_compare_lines(const char *l1, long s1,
>>> +                           const char *l2, long s2, long flags);
>>
>> With the added comment it's OK here.
>>
>> Still, I find the tendency in libxdiff to use the shortest possible
>> variable names to be hard on the eyes.  That math-like notation may have
>> its place in that external library, but I think we should be careful
>> lest it spreads.
>
> Yeah, I tend to agree.  The xdiff-interface is at the (surprise!)
> interface layer, so we could make it follow the style of either one,
> but we seem to have picked up the convention of the lower layer,
> so...
>
> By the way, I was looking at the code around this area while
> reviewing the cr-at-eol thing, and noticed a couple of things:
>
>
>  * combine-diff.c special cases only IGNORE_WHITESPACE and
>    IGNORE_WHITESPACE_CHANGE but no other IGNORE_WHITESPACE things; I
>    have a suspicion that this is not because IGNORE_WHITESPACE_AT_EOL
>    does not have to special cased by the codepath, but only because
>    the combine-diff.c refactoring predates the introduction of
>    ws-at-eol thing?

I wonder how much overlap between the IGNORE_WHITESPACE_AT_EOL
and CRLF-AT-EOL is (maybe these can be combined, as per the root of
this discussion)

>    The machinery uses its own match_string_spaces() helper; it may
>    make sense to use the same xdiff_compare_lines() in its callers
>    and get rid of this helper function.

Speaking of xdiff_compare_lines, it serves the special purpose of the
move detection as well as its internal use cases, but we may need to
change its interface/implementation in the future, to align it with strcmp,
currently the compare function only returns equality, not an order.

>  * diff_setup_done() sets DIFF_FROM_CONTENTS when any of the
>    IGNORE_WHITESPACE bits is true, to allow "git diff -q
>    --ignore-something" would do the right thing.  We do not however
>    give a similar special case to XDF_IGNORE_BLANK_LINES.
>
>    $ echo >>Makefile && git diff $option --ignore-blank-lines Makefile
>
>    exits with 1 when option=--exit-code and it exits with 0 when
>    option=-q
>
>
> For now I'll leve these as #leftoverbits, but I may come back to the
> latter soonish.  I won't come back to the former until Stefan's
> refactor hits 'master', though.

which is presumably after the 2.15 release.

To tack on the #leftoverbits: The --color-moved detection doesn't
pay attention to XDF_IGNORE_BLANK_LINES (which is tricky as
it is on the per-line layer. If we want to ignore spurious blank lines,
we'd have to check for this flag in diff.c in mark_color_as_moved(..)
in the block
    /* Check any potential block runs, advance each or nullify */

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] diff.c: get rid of duplicate implementation
  2017-10-26 17:43           ` René Scharfe
@ 2017-10-30 17:59             ` Jeff King
  2017-10-30 19:07               ` Jeff King
  0 siblings, 1 reply; 31+ messages in thread
From: Jeff King @ 2017-10-30 17:59 UTC (permalink / raw)
  To: René Scharfe; +Cc: Stefan Beller, gitster, git, jrnieder

On Thu, Oct 26, 2017 at 07:43:19PM +0200, René Scharfe wrote:

> Am 25.10.2017 um 20:49 schrieb Stefan Beller:
> > The implementations in diff.c to detect moved lines needs to compare
> > strings and hash strings, which is implemented in that file, as well
> > as in the xdiff library.
> > 
> > Remove the rather recent implementation in diff.c and rely on the well
> > exercised code in the xdiff lib.
> > 
> > With this change the hash used for bucketing the strings for the moved
> > line detection changes from FNV32 (that is provided via the hashmaps
> > memhash) to DJB2 (which is used internally in xdiff).  Benchmarks found
> > on the web[1] do not indicate that these hashes are different in
> > performance for readable strings.
> > 
> > [1] https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
> 
> Awesome comparison!  It calls the variant used in libxdiff DJB2a (which
> uses xor to mix in the new char) instead of DJB2 (which uses addition).
> 
> There's also https://www.strchr.com/hash_functions, which lists DJB2
> as Bernstein.  Both functions rank somewhere in the middle of that list.

FWIW, I did some experiments with Murmur3 and SipHash a while back, but
I don't think I came up with anything faster than the existing code.
OTOH, moving to SipHash gives us the ability to randomize the hashes,
which can resist some DoS attacks (although as I've said before,
computing arbitrary diffs for untrusted strangers is pretty much a
DoS-in-a-box).

Anyway, I rebased them and ran p4000[1], with does show some results:

Test                                  origin            jk/xdl-murmur3-wip       jk/xdl-siphash-wip    
-------------------------------------------------------------------------------------------------------
4000.1: log -3000 (baseline)          0.05(0.05+0.00)   0.05(0.05+0.00) +0.0%    0.05(0.05+0.00) +0.0% 
4000.2: log --raw -3000 (tree-only)   0.30(0.28+0.02)   0.30(0.28+0.01) +0.0%    0.30(0.28+0.02) +0.0% 
4000.3: log -p -3000 (Myers)          2.06(1.98+0.08)   1.90(1.85+0.05) -7.8%    2.32(2.25+0.07) +12.6%
4000.4: log -p -3000 --histogram      2.50(2.42+0.07)   2.25(2.21+0.04) -10.0%   2.70(2.62+0.08) +8.0% 
4000.5: log -p -3000 --patience       2.58(2.52+0.06)   2.47(2.42+0.04) -4.3%    2.86(2.77+0.08) +10.9%

So it looks like murmur3 is indeed a little faster. SipHash is slower,
which is too bad (because the randomization would be nice). I _think_
back then I compared to XDL_FAST_HASH, which was a little faster even
than murmur3 (but generated too many collisions, which led to bad
behavior for certain cases).

Anyway, those branches are at https://github.com/peff/git if anybody
wants to look further. They probably need some cleanup. At the very
least, we'd probably need to teach the whitespace-ignoring hash function
to use the same algorithm.

-Peff

[1] Actually, I added "--no-merges" to each command in p4000. It seems
    silly as it is, since the point is to compute a bunch of diffs.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/2] diff.c: get rid of duplicate implementation
  2017-10-30 17:59             ` Jeff King
@ 2017-10-30 19:07               ` Jeff King
  0 siblings, 0 replies; 31+ messages in thread
From: Jeff King @ 2017-10-30 19:07 UTC (permalink / raw)
  To: René Scharfe; +Cc: Stefan Beller, gitster, git, jrnieder

On Mon, Oct 30, 2017 at 10:59:41AM -0700, Jeff King wrote:

> > There's also https://www.strchr.com/hash_functions, which lists DJB2
> > as Bernstein.  Both functions rank somewhere in the middle of that list.
> 
> FWIW, I did some experiments with Murmur3 and SipHash a while back, but
> I don't think I came up with anything faster than the existing code.
> OTOH, moving to SipHash gives us the ability to randomize the hashes,
> which can resist some DoS attacks (although as I've said before,
> computing arbitrary diffs for untrusted strangers is pretty much a
> DoS-in-a-box).

By the way, one of the things that complicates plugging new functions
into xdiff's hashing is that xdl_hash_record() simultaneously computes
the hash _and_ finds the end-of-line marker.

So the "siphash is only 10% slower" number I showed came with quite a
few contortions to do both. Here it is compared to a more naive
application of the siphash code (i.e., memchr to find end-of-line, and
then feed the resulting bytes to siphash):

Test                                  origin            HEAD^                    jk/xdl-siphash-wip
-------------------------------------------------------------------------------------------------------
4000.1: log -3000 (baseline)          0.05(0.05+0.00)   0.05(0.05+0.00) +0.0%    0.05(0.05+0.00) +0.0%
4000.2: log --raw -3000 (tree-only)   0.31(0.27+0.03)   0.31(0.27+0.03) +0.0%    0.31(0.27+0.03) +0.0%
4000.3: log -p -3000 (Myers)          2.06(2.01+0.05)   2.30(2.21+0.09) +11.7%   2.96(2.91+0.04) +43.7%
4000.4: log -p -3000 --histogram      2.44(2.38+0.06)   2.67(2.60+0.07) +9.4%    3.32(3.26+0.06) +36.1%
4000.5: log -p -3000 --patience       2.57(2.47+0.09)   2.90(2.82+0.08) +12.8%   3.48(3.43+0.05) +35.4%

There "origin" is the existing djb hash, "HEAD^" is the complicated
"fast" siphash (which I very well may have screwed up), and the final is
the more naive version, which is quite a bit slower.

-Peff

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-10-30 19:07 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-24 18:59 [PATCH 0/4] (x)diff cleanup: remove duplicate code Stefan Beller
2017-10-24 18:59 ` [PATCH 1/4] hashmap: introduce memhash_feed to access the internals of FNV-1 hash Stefan Beller
2017-10-24 20:23   ` René Scharfe
2017-10-24 20:48     ` Stefan Beller
2017-10-24 18:59 ` [PATCH 2/4] xdiff-interface: export comparing and hashing strings Stefan Beller
2017-10-24 20:23   ` René Scharfe
2017-10-24 20:42     ` Stefan Beller
2017-10-26 17:03       ` René Scharfe
2017-10-24 18:59 ` [PATCH 3/4] xdiff: use stronger hash function internally Stefan Beller
2017-10-24 20:23   ` René Scharfe
2017-10-24 20:46     ` Stefan Beller
2017-10-24 23:04   ` Jonathan Nieder
2017-10-24 18:59 ` [PATCH 4/4] diff.c: get rid of duplicate implementation Stefan Beller
2017-10-24 19:02 ` [PATCH 0/4] (x)diff cleanup: remove duplicate code Stefan Beller
2017-10-24 23:42   ` [PATCHv2 0/2] " Stefan Beller
2017-10-24 23:42     ` [PATCH 1/2] xdiff-interface: export comparing and hashing strings Stefan Beller
2017-10-25  4:26       ` Eric Sunshine
2017-10-24 23:42     ` [PATCH 2/2] diff.c: get rid of duplicate implementation Stefan Beller
2017-10-25  5:11     ` [PATCHv2 0/2] (x)diff cleanup: remove duplicate code Junio C Hamano
2017-10-25 18:47       ` Stefan Beller
2017-10-25 18:49       ` [PATCHv3 " Stefan Beller
2017-10-25 18:49         ` [PATCH 1/2] xdiff-interface: export comparing and hashing strings Stefan Beller
2017-10-26 17:12           ` René Scharfe
2017-10-27  7:12             ` Junio C Hamano
2017-10-27 17:15               ` Stefan Beller
2017-10-25 18:49         ` [PATCH 2/2] diff.c: get rid of duplicate implementation Stefan Beller
2017-10-26  2:23           ` Junio C Hamano
2017-10-26 17:43           ` René Scharfe
2017-10-30 17:59             ` Jeff King
2017-10-30 19:07               ` Jeff King
2017-10-24 23:45   ` [PATCH 1/5] fnv: migrate code from hashmap to fnv Stefan Beller

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).