git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH/RFC 0/5] New hash table implementation
@ 2013-09-10 23:27 Karsten Blees
  2013-09-10 23:28 ` [PATCH/RFC 1/5] add a hashtable implementation that supports O(1) removal Karsten Blees
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-10 23:27 UTC (permalink / raw
  To: Git List

Also here: https://github.com/kblees/git/tree/kb/hashmap

Hi,

this is a spin-off of my (very slowly progressing) msysgit fscache project. I needed to remove things from the hash table, which cannot be implemented efficiently in hash.[ch].

So I wrote hasmap.[ch], with these features:
- O(1) remove
- builtin entry chaining
- ready-to-use FNV-1 hash functions
- unit test
- additions are ~twice as fast
- uses less memory

Patches 2 and 5 convert existing uses of hash.[ch] to hashmap.[ch].
Patches 3 and 4 are useful optimizations of their own.

I haven't found the time to tackle name-hash.c yet, this is where remove() could come into play (to replace the CE_UNHASHED flag).

Karsten


Karsten Blees (5):
  add a hashtable implementation that supports O(1) removal
  buitin/describe.c: use new hash map implementation
  diffcore-rename.c: move code around to prepare for the next patch
  diffcore-rename.c: simplify finding exact renames
  diffcore-rename.c: use new hash map implementation

 Makefile           |   3 +
 builtin/describe.c |  53 +++++------
 diffcore-rename.c  | 185 +++++++++++++-------------------------
 hashmap.c          | 210 +++++++++++++++++++++++++++++++++++++++++++
 hashmap.h          | 200 +++++++++++++++++++++++++++++++++++++++++
 t/t0011-hashmap.sh | 236 ++++++++++++++++++++++++++++++++++++++++++++++++
 test-hashmap.c     | 258 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 995 insertions(+), 150 deletions(-)
 create mode 100644 hashmap.c
 create mode 100644 hashmap.h
 create mode 100755 t/t0011-hashmap.sh
 create mode 100644 test-hashmap.c

-- 
1.8.4.8243.gbcbdefd

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH/RFC 1/5] add a hashtable implementation that supports O(1) removal
  2013-09-10 23:27 [PATCH/RFC 0/5] New hash table implementation Karsten Blees
@ 2013-09-10 23:28 ` Karsten Blees
  2013-09-11 23:56   ` Junio C Hamano
  2013-09-12  4:10   ` Junio C Hamano
  2013-09-10 23:28 ` [PATCH/RFC 2/5] buitin/describe.c: use new hash map implementation Karsten Blees
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-10 23:28 UTC (permalink / raw
  To: Git List

The existing hashtable implementation (in hash.[ch]) uses open addressing
(i.e. resolve hash collisions by distributing entries across the table).
Thus, removal is difficult to implement with less than O(n) complexity.
Resolving collisions of entries with identical hashes (e.g. via chaining)
is left to the client code.

Add a hashtable implementation that supports O(1) removal and is slightly
easier to use due to builtin entry chaining.

Supports all basic operations init, free, get, add, remove and iteration.

Also includes ready-to-use hash functions based on the public domain FNV-1
algorithm (http://www.isthe.com/chongo/tech/comp/fnv).

The per-entry data structure (hashmap_entry) is meant to be piggybacked
in front of the client's data structure to save memory. See test-hashmap.c
for usage examples.

The hashtable is resized by a factor of four when 80% full. With these
settings, average memory consumption is about 2/3 of hash.[ch], and
insertion is about twice as fast due to less frequent resizing.

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 Makefile           |   3 +
 hashmap.c          | 210 +++++++++++++++++++++++++++++++++++++++++++
 hashmap.h          | 200 +++++++++++++++++++++++++++++++++++++++++
 t/t0011-hashmap.sh | 236 ++++++++++++++++++++++++++++++++++++++++++++++++
 test-hashmap.c     | 258 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 907 insertions(+)
 create mode 100644 hashmap.c
 create mode 100644 hashmap.h
 create mode 100755 t/t0011-hashmap.sh
 create mode 100644 test-hashmap.c

diff --git a/Makefile b/Makefile
index 3588ca1..e6ad011 100644
--- a/Makefile
+++ b/Makefile
@@ -562,6 +562,7 @@ TEST_PROGRAMS_NEED_X += test-date
 TEST_PROGRAMS_NEED_X += test-delta
 TEST_PROGRAMS_NEED_X += test-dump-cache-tree
 TEST_PROGRAMS_NEED_X += test-genrandom
+TEST_PROGRAMS_NEED_X += test-hashmap
 TEST_PROGRAMS_NEED_X += test-index-version
 TEST_PROGRAMS_NEED_X += test-line-buffer
 TEST_PROGRAMS_NEED_X += test-match-trees
@@ -681,6 +682,7 @@ LIB_H += gpg-interface.h
 LIB_H += graph.h
 LIB_H += grep.h
 LIB_H += hash.h
+LIB_H += hashmap.h
 LIB_H += help.h
 LIB_H += http.h
 LIB_H += kwset.h
@@ -811,6 +813,7 @@ LIB_OBJS += gpg-interface.o
 LIB_OBJS += graph.o
 LIB_OBJS += grep.o
 LIB_OBJS += hash.o
+LIB_OBJS += hashmap.o
 LIB_OBJS += help.o
 LIB_OBJS += hex.o
 LIB_OBJS += ident.o
diff --git a/hashmap.c b/hashmap.c
new file mode 100644
index 0000000..686ee6f
--- /dev/null
+++ b/hashmap.c
@@ -0,0 +1,210 @@
+/*
+ * Generic implementation of hash-based key value mappings.
+ */
+#include "cache.h"
+#include "hashmap.h"
+
+#define FNV32_BASE ((unsigned int) 0x811c9dc5)
+#define FNV32_PRIME ((unsigned int) 0x01000193)
+
+unsigned int strhash(const char *str)
+{
+	unsigned int c, hash = FNV32_BASE;
+	while ((c = (unsigned char) *str++))
+		hash = (hash * FNV32_PRIME) ^ c;
+	return hash;
+}
+
+unsigned int strihash(const char *str)
+{
+	unsigned int c, hash = FNV32_BASE;
+	while ((c = (unsigned char) *str++)) {
+		if (c >= 'a' && c <= 'z')
+			c -= 'a' - 'A';
+		hash = (hash * FNV32_PRIME) ^ c;
+	}
+	return hash;
+}
+
+unsigned int memhash(const void *buf, size_t len)
+{
+	unsigned int hash = FNV32_BASE;
+	unsigned char *ucbuf = (unsigned char*) buf;
+	while (len--) {
+		unsigned int c = *ucbuf++;
+		hash = (hash * FNV32_PRIME) ^ c;
+	}
+	return hash;
+}
+
+unsigned int memihash(const void *buf, size_t len)
+{
+	unsigned int hash = FNV32_BASE;
+	unsigned char *ucbuf = (unsigned char*) buf;
+	while (len--) {
+		unsigned int c = *ucbuf++;
+		if (c >= 'a' && c <= 'z')
+			c -= 'a' - 'A';
+		hash = (hash * FNV32_PRIME) ^ c;
+	}
+	return hash;
+}
+
+#define HASHMAP_INITIAL_SIZE 64
+/* grow / shrink by 2^2 */
+#define HASHMAP_GROW 2
+/* grow if > 80% full (to 20%) */
+#define HASHMAP_GROW_AT 1.25
+/* shrink if < 16.6% full (to 66.6%) */
+#define HASHMAP_SHRINK_AT 6
+
+static inline int entry_equals(const hashmap *map, const hashmap_entry *e1,
+		const hashmap_entry *e2)
+{
+	return (e1 == e2) || (e1->hash == e2->hash && !(*map->cmpfn)(e1, e2));
+}
+
+static inline unsigned int bucket(const hashmap *map, const hashmap_entry *key)
+{
+	return key->hash & (map->tablesize - 1);
+}
+
+static void rehash(hashmap *map, unsigned int newsize)
+{
+	unsigned int i, oldsize = map->tablesize;
+	hashmap_entry **oldtable = map->table;
+
+	map->tablesize = newsize;
+	map->table = xcalloc(sizeof(hashmap_entry*), map->tablesize);
+	for (i = 0; i < oldsize; i++) {
+		hashmap_entry *e = oldtable[i];
+		while (e) {
+			hashmap_entry *next = e->next;
+			unsigned int b = bucket(map, e);
+			e->next = map->table[b];
+			map->table[b] = e;
+			e = next;
+		}
+	}
+	free(oldtable);
+}
+
+static inline hashmap_entry **find_entry(const hashmap *map,
+		const hashmap_entry *key)
+{
+	hashmap_entry **e = &map->table[bucket(map, key)];
+	while (*e && !entry_equals(map, *e, key))
+		e = &(*e)->next;
+	return e;
+}
+
+static int always_equal(const void *unused1, const void *unused2)
+{
+	return 0;
+}
+
+void hashmap_init(hashmap *map, hashmap_cmp_fn equals_function,
+		size_t initial_size)
+{
+	map->size = 0;
+	map->cmpfn = equals_function ? equals_function : always_equal;
+	/* calculate initial table size and allocate the table */
+	map->tablesize = HASHMAP_INITIAL_SIZE;
+	initial_size *= HASHMAP_GROW_AT;
+	while (initial_size > map->tablesize)
+		map->tablesize <<= HASHMAP_GROW;
+	map->table = xcalloc(sizeof(hashmap_entry*), map->tablesize);
+}
+
+void hashmap_free(hashmap *map, hashmap_free_fn free_function)
+{
+	if (!map || !map->table)
+		return;
+	if (free_function) {
+		hashmap_iter iter;
+		hashmap_entry *e;
+		hashmap_iter_init(map, &iter);
+		while ((e = hashmap_iter_next(&iter)))
+			(*free_function)(e);
+	}
+	free(map->table);
+	memset(map, 0, sizeof(*map));
+}
+
+void *hashmap_get(const hashmap *map, const void *key)
+{
+	return *find_entry(map, key);
+}
+
+void *hashmap_get_next(const hashmap *map, const void *entry)
+{
+	hashmap_entry *e = ((hashmap_entry*) entry)->next;
+	for (; e; e = e->next)
+		if (entry_equals(map, entry, e))
+			return e;
+	return NULL;
+}
+
+void hashmap_add(hashmap *map, void *entry)
+{
+	unsigned int b = bucket(map, entry);
+
+	/* add entry */
+	((hashmap_entry*) entry)->next = map->table[b];
+	map->table[b] = entry;
+
+	/* fix size and rehash if appropriate */
+	map->size++;
+	if (map->size * HASHMAP_GROW_AT > map->tablesize)
+		rehash(map, map->tablesize << HASHMAP_GROW);
+}
+
+void *hashmap_remove(hashmap *map, const void *key)
+{
+	hashmap_entry *old;
+	hashmap_entry **e = find_entry(map, key);
+	if (!*e)
+		return NULL;
+
+	/* remove existing entry */
+	old = *e;
+	*e = old->next;
+	old->next = NULL;
+
+	/* fix size and rehash if appropriate */
+	map->size--;
+	if (map->tablesize > HASHMAP_INITIAL_SIZE &&
+		map->size * HASHMAP_SHRINK_AT < map->tablesize)
+		rehash(map, map->tablesize >> HASHMAP_GROW);
+	return old;
+}
+
+void *hashmap_put(hashmap *map, void *entry)
+{
+	hashmap_entry *old = hashmap_remove(map, entry);
+	hashmap_add(map, entry);
+	return old;
+}
+
+void hashmap_iter_init(hashmap *map, hashmap_iter *iter)
+{
+	iter->map = map;
+	iter->tablepos = 0;
+	iter->next = NULL;
+}
+
+void *hashmap_iter_next(hashmap_iter *iter)
+{
+	hashmap_entry *current = iter->next;
+	for (;;) {
+		if (current) {
+			iter->next = current->next;
+			return current;
+		}
+
+		if (iter->tablepos >= iter->map->tablesize)
+			return NULL;
+
+		current = iter->map->table[iter->tablepos++];
+	}
+}
diff --git a/hashmap.h b/hashmap.h
new file mode 100644
index 0000000..59f8489
--- /dev/null
+++ b/hashmap.h
@@ -0,0 +1,200 @@
+#ifndef HASHMAP_H
+#define HASHMAP_H
+
+/*
+ * Generic implementation of hash-based key value mappings.
+ * Supports basic operations get, add/put, remove and iteration.
+ *
+ * Also contains a set of ready-to-use hash functions for strings, using the
+ * FNV-1 algorithm (see http://www.isthe.com/chongo/tech/comp/fnv).
+ */
+
+/*
+ * Case-sensitive FNV-1 hash of 0-terminated string.
+ * str: the string
+ * returns hash code
+ */
+extern unsigned int strhash(const char *buf);
+
+/*
+ * Case-insensitive FNV-1 hash of 0-terminated string.
+ * str: the string
+ * returns hash code
+ */
+extern unsigned int strihash(const char *buf);
+
+/*
+ * Case-sensitive FNV-1 hash of a memory block.
+ * buf: start of the memory block
+ * len: length of the memory block
+ * returns hash code
+ */
+extern unsigned int memhash(const void *buf, size_t len);
+
+/*
+ * Case-insensitive FNV-1 hash of a memory block.
+ * buf: start of the memory block
+ * len: length of the memory block
+ * returns hash code
+ */
+extern unsigned int memihash(const void *buf, size_t len);
+
+/*
+ * Hashmap entry data structure, intended to be used as first member of user
+ * data structures. Consists of a pointer and an int. Ideally it should be
+ * followed by an int-sized member to prevent unused memory on 64-bit systems
+ * due to alignment.
+ */
+typedef struct hashmap_entry {
+	struct hashmap_entry *next;
+	unsigned int hash;
+} hashmap_entry;
+
+/*
+ * User-supplied function to test two hashmap entries for equality, shall
+ * return 0 if the entries are equal. This function is always called with
+ * non-NULL parameters that have the same hash code. When looking up an entry,
+ * the key parameter to hashmap_get and hashmap_remove is always passed as
+ * second argument.
+ */
+typedef int (*hashmap_cmp_fn)(const void *entry, const void *entry_or_key);
+
+/*
+ * User-supplied function to free a hashmap entry.
+ */
+typedef void (*hashmap_free_fn)(void *entry);
+
+/*
+ * Hashmap data structure, use with hashmap_* functions.
+ */
+typedef struct hashmap {
+	hashmap_entry **table;
+	hashmap_cmp_fn cmpfn;
+	unsigned int size, tablesize;
+} hashmap;
+
+/*
+ * Hashmap iterator data structure, use with hasmap_iter_* functions.
+ */
+typedef struct hashmap_iter {
+	hashmap *map;
+	hashmap_entry *next;
+	unsigned int tablepos;
+} hashmap_iter;
+
+/*
+ * Initializes a hashmap_entry structure.
+ * entry: pointer to the entry to initialize
+ * hash: hash code of the entry
+ * key_only: true if entry is a key-only structure, see hashmap_entry_is_key
+ */
+static inline void hashmap_entry_init(void *entry, int hash, int key_only)
+{
+	hashmap_entry *e = entry;
+	e->hash = hash;
+	e->next = key_only ? (hashmap_entry*) -1 : NULL;
+}
+
+/*
+ * Checks if hashmap_entry was initialized with the key_only flag. This is
+ * useful if the entry structure is variable-sized (e.g. ending in a FLEX_ARRAY)
+ * and the key is part of the variable portion. To prevent dynamic allocation of
+ * a full-fledged entry structure for each lookup, a smaller, statically sized
+ * structure can be used as key (i.e. replacing the FLEX_ARRAY member with a
+ * char pointer). The hashmap_cmp_fn comparison function can then check whether
+ * entry_or_key is a full-fledged entry or a key-only structure.
+ * entry: pointer to the entry to check
+ * returns 0 for key-value entries and non-0 for key-only entries
+ */
+static inline int hashmap_entry_is_key(const void *entry)
+{
+	const hashmap_entry *e = entry;
+	return e->next == (hashmap_entry*) -1;
+}
+
+/*
+ * Initializes a hashmap structure.
+ * map: hashmap to initialize
+ * equals_function: optional function to test equality of hashmap entries. If
+ *  NULL, entries are considered equal if their hash codes are equal.
+ * initial_size: optional number of initial entries, 0 if unknown
+ */
+extern void hashmap_init(hashmap *map, hashmap_cmp_fn equals_function,
+		size_t initial_size);
+
+/*
+ * Frees a hashmap structure and allocated memory.
+ * map: hashmap to free
+ * free_function: optional function to free the hashmap entries
+ */
+extern void hashmap_free(hashmap *map, hashmap_free_fn free_function);
+
+/*
+ * Returns the hashmap entry for the specified key, or NULL if not found.
+ * map: the hashmap
+ * key: key of the entry to look up
+ * returns matching hashmap entry, or NULL if not found
+ */
+extern void *hashmap_get(const hashmap *map, const void *key);
+
+/*
+ * Returns the next equal hashmap entry if the map contains duplicates (see
+ * hashmap_add).
+ * map: the hashmap
+ * entry: current entry, obtained via hashmap_get or hashmap_get_next
+ * returns next equal hashmap entry, or NULL if not found
+ */
+extern void *hashmap_get_next(const hashmap *map, const void *entry);
+
+/*
+ * Adds a hashmap entry. This allows to add duplicate entries (i.e. separate
+ * values with the same key according to hashmap_cmp_fn).
+ * map: the hashmap
+ * entry: the entry to add
+ */
+extern void hashmap_add(hashmap *map, void *entry);
+
+/*
+ * Adds or replaces a hashmap entry.
+ * map: the hashmap
+ * entry: the entry to add or replace
+ * returns previous entry, or NULL if the entry is new
+ */
+extern void *hashmap_put(hashmap *map, void *entry);
+
+/*
+ * Removes a hashmap entry matching the specified key.
+ * map: the hashmap
+ * key: key of the entry to remove
+ * returns removed entry, or NULL if not found
+ */
+extern void *hashmap_remove(hashmap *map, const void *key);
+
+/*
+ * Initializes a hashmap iterator structure.
+ * map: the hashmap
+ * iter: hashmap iterator structure
+ */
+extern void hashmap_iter_init(hashmap *map, hashmap_iter *iter);
+
+/**
+ * Returns the next hashmap entry.
+ * iter: hashmap iterator
+ * returns next entry, or NULL if there are no more entries
+ */
+extern void *hashmap_iter_next(hashmap_iter *iter);
+
+/**
+ * Initializes a hashmap iterator and returns the first hashmap entry.
+ * map: the hashmap
+ * iter: hashmap iterator
+ * returns first entry, or NULL if there are no entries
+ */
+static inline void *hashmap_iter_first(hashmap *map,
+		hashmap_iter *iter)
+{
+	hashmap_iter_init(map, iter);
+	return hashmap_iter_next(iter);
+}
+
+#endif
diff --git a/t/t0011-hashmap.sh b/t/t0011-hashmap.sh
new file mode 100755
index 0000000..6c699d5
--- /dev/null
+++ b/t/t0011-hashmap.sh
@@ -0,0 +1,236 @@
+#!/bin/sh
+
+test_description='test hashmap and string hash functions'
+. ./test-lib.sh
+
+test_hashmap() {
+	echo "$1" | test-hashmap $3 > actual &&
+	echo "$2" > expect &&
+	test_cmp expect actual
+}
+
+test_expect_success 'hash functions' '
+
+test_hashmap "hash key1" "2215982743 2215982743 116372151 116372151" &&
+test_hashmap "hash key2" "2215982740 2215982740 116372148 116372148" &&
+test_hashmap "hash fooBarFrotz" "1383912807 1383912807 3189766727 3189766727" &&
+test_hashmap "hash foobarfrotz" "2862305959 2862305959 3189766727 3189766727"
+
+'
+
+test_expect_success 'put' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+put foobarfrotz value4
+size" "NULL
+NULL
+NULL
+NULL
+64 4"
+
+'
+
+test_expect_success 'put (case insensitive)' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+size" "NULL
+NULL
+NULL
+64 3" ignorecase
+
+'
+
+test_expect_success 'replace' '
+
+test_hashmap "put key1 value1
+put key1 value2
+put fooBarFrotz value3
+put fooBarFrotz value4
+size" "NULL
+value1
+NULL
+value3
+64 2"
+
+'
+
+test_expect_success 'replace (case insensitive)' '
+
+test_hashmap "put key1 value1
+put Key1 value2
+put fooBarFrotz value3
+put foobarfrotz value4
+size" "NULL
+value1
+NULL
+value3
+64 2" ignorecase
+
+'
+
+test_expect_success 'get' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+put foobarfrotz value4
+get key1
+get key2
+get fooBarFrotz
+get notInMap" "NULL
+NULL
+NULL
+NULL
+value1
+value2
+value3
+NULL"
+
+'
+
+test_expect_success 'get (case insensitive)' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+get Key1
+get keY2
+get foobarfrotz
+get notInMap" "NULL
+NULL
+NULL
+value1
+value2
+value3
+NULL" ignorecase
+
+'
+
+test_expect_success 'add' '
+
+test_hashmap "add key1 value1
+add key1 value2
+add fooBarFrotz value3
+add fooBarFrotz value4
+get key1
+get fooBarFrotz
+get notInMap" "value2
+value1
+value4
+value3
+NULL"
+
+'
+
+test_expect_success 'add (case insensitive)' '
+
+test_hashmap "add key1 value1
+add Key1 value2
+add fooBarFrotz value3
+add foobarfrotz value4
+get key1
+get Foobarfrotz
+get notInMap" "value2
+value1
+value4
+value3
+NULL" ignorecase
+
+'
+
+test_expect_success 'remove' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+remove key1
+remove key2
+remove notInMap
+size" "NULL
+NULL
+NULL
+value1
+value2
+NULL
+64 1"
+
+'
+
+test_expect_success 'remove (case insensitive)' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+remove Key1
+remove keY2
+remove notInMap
+size" "NULL
+NULL
+NULL
+value1
+value2
+NULL
+64 1" ignorecase
+
+'
+
+test_expect_success 'iterate' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+iterate" "NULL
+NULL
+NULL
+key2 value2
+key1 value1
+fooBarFrotz value3"
+
+'
+
+test_expect_success 'iterate (case insensitive)' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+iterate" "NULL
+NULL
+NULL
+fooBarFrotz value3
+key2 value2
+key1 value1" ignorecase
+
+'
+
+test_expect_success 'grow / shrink' '
+
+	rm -f in &&
+	rm -f expect &&
+	for n in $(test_seq 51)
+	do
+		echo put key$n value$n >> in &&
+		echo NULL >> expect
+	done &&
+	echo size >> in &&
+	echo 64 51 >> expect &&
+	echo put key52 value52 >> in &&
+	echo NULL >> expect
+	echo size >> in &&
+	echo 256 52 >> expect &&
+	for n in $(test_seq 10)
+	do
+		echo remove key$n >> in &&
+		echo value$n >> expect
+	done &&
+	echo size >> in &&
+	echo 64 42 >> expect &&
+	cat in | test-hashmap > out &&
+	test_cmp expect out
+
+'
+
+test_done
diff --git a/test-hashmap.c b/test-hashmap.c
new file mode 100644
index 0000000..a4b3440
--- /dev/null
+++ b/test-hashmap.c
@@ -0,0 +1,258 @@
+#include "cache.h"
+#include "hashmap.h"
+#include <stdio.h>
+
+typedef struct test_entry
+{
+	hashmap_entry ent;
+	/* key and value as two \0-terminated strings */
+	char key[FLEX_ARRAY];
+} test_entry;
+
+typedef struct test_key
+{
+	hashmap_entry ent;
+	char *key;
+} test_key;
+
+static const char *get_key(const test_entry *e)
+{
+	return hashmap_entry_is_key(e) ? ((test_key*) e)->key : e->key;
+}
+
+static const char *get_value(const test_entry *e)
+{
+	return e->key + strlen(e->key) + 1;
+}
+
+static int test_entry_cmp(const test_entry *e1, const test_entry *e2)
+{
+	return strcmp(e1->key, get_key(e2));
+}
+
+static int test_entry_cmp_icase(const test_entry *e1, const test_entry *e2)
+{
+	return strcasecmp(e1->key, get_key(e2));
+}
+
+static test_entry *alloc_test_entry(int hash, char *key, int klen, char *value,
+		int vlen)
+{
+	test_entry *entry = malloc(sizeof(test_entry) + klen + vlen + 2);
+	hashmap_entry_init(entry, hash, 0);
+	memcpy(entry->key, key, klen + 1);
+	memcpy(entry->key + klen + 1, value, vlen + 1);
+	return entry;
+}
+
+/*
+ * Test insert performance of hashmap.[ch]
+ * Usage: time echo "perfhashmap size rounds" | test-hashmap
+ */
+static void perf_hashmap(unsigned int size, unsigned int rounds)
+{
+	hashmap map;
+	char buf[16];
+	test_entry **entries;
+	unsigned int i, j;
+
+	entries = malloc(size * sizeof(test_entry*));
+	for (i = 0; i < size; i++) {
+		snprintf(buf, sizeof(buf), "%i", i);
+		entries[i] = alloc_test_entry(0, buf, strlen(buf), "", 0);
+	}
+
+	for (j = 0; j < rounds; j++) {
+		// initialize the map
+		hashmap_init(&map, (hashmap_cmp_fn) test_entry_cmp, 0);
+
+		// add entries
+		for (i = 0; i < size; i++) {
+			unsigned int hash = strhash(entries[i]->key);
+			hashmap_entry_init(entries[i], hash, 0);
+			hashmap_add(&map, entries[i]);
+		}
+
+		hashmap_free(&map, NULL);
+	}
+}
+
+typedef struct hash_entry
+{
+	struct hash_entry *next;
+	char key[FLEX_ARRAY];
+} hash_entry;
+
+/*
+ * Test insert performance of hash.[ch]
+ * Usage: time echo "perfhashtable size rounds" | test-hashmap
+ */
+static void perf_hashtable(unsigned int size, unsigned int rounds)
+{
+	struct hash_table map;
+	char buf[16];
+	hash_entry **entries, **res;
+	unsigned int i, j;
+
+	entries = malloc(size * sizeof(hash_entry*));
+	for (i = 0; i < size; i++) {
+		snprintf(buf, sizeof(buf), "%i", i);
+		entries[i] = malloc(sizeof(hash_entry) + strlen(buf) + 1);
+		strcpy(entries[i]->key, buf);
+	}
+
+	for (j = 0; j < rounds; j++) {
+		// initialize the map
+		init_hash(&map);
+
+		// add entries
+		for (i = 0; i < size; i++) {
+			unsigned int hash = strhash(entries[i]->key);
+			res = (hash_entry**) insert_hash(hash, entries[i], &map);
+			if (res) {
+				entries[i]->next = *res;
+				*res = entries[i];
+			} else {
+				entries[i]->next = NULL;
+			}
+		}
+
+		free_hash(&map);
+	}
+}
+
+#define DELIM " \t\r\n"
+
+/*
+ * Read stdin line by line and print result of commands to stdout:
+ *
+ * hash key -> strhash(key) memhash(key) strihash(key) memihash(key)
+ * put key value -> NULL / old value
+ * get key -> NULL / value
+ * remove key -> NULL / old value
+ * iterate -> key1 value1\nkey2 value2\n...
+ * size -> tablesize numentries
+ *
+ * perfhashmap size rounds -> hashmap.[ch]: add <size> entries <rounds> times
+ * perfhashtable size rounds -> hash.[ch]: add <size> entries <rounds> times
+ */
+int main(int argc, char *argv[])
+{
+	char line[1024];
+	hashmap map;
+	int icase;
+
+	/* init hash map */
+	icase = argc > 1 && !strcmp("ignorecase", argv[1]);
+	hashmap_init(&map, (hashmap_cmp_fn) (icase ? test_entry_cmp_icase
+			: test_entry_cmp), 0);
+
+	/* process commands from stdin */
+	while (fgets(line, sizeof(line), stdin)) {
+		char *cmd, *p1 = NULL, *p2 = NULL;
+		int l1 = 0, l2 = 0, hash = 0;
+		test_entry *entry;
+
+		/* break line into command and up to two parameters */
+		cmd = strtok(line, DELIM);
+		/* ignore empty lines */
+		if (!cmd || *cmd == '#')
+			continue;
+
+		p1 = strtok(NULL, DELIM);
+		if (p1) {
+			l1 = strlen(p1);
+			hash = icase ? strihash(p1) : strhash(p1);
+			p2 = strtok(NULL, DELIM);
+			if (p2)
+				l2 = strlen(p2);
+		}
+
+		if (!strcmp("hash", cmd) && l1) {
+
+			/* print results of different hash functions */
+			printf("%u %u %u %u\n", strhash(p1), memhash(p1, l1),
+					strihash(p1), memihash(p1, l1));
+
+		} else if (!strcmp("add", cmd) && l1 && l2) {
+
+			/* create entry with key = p1, value = p2 */
+			entry = alloc_test_entry(hash, p1, l1, p2, l2);
+
+			/* add to hashmap */
+			hashmap_add(&map, entry);
+
+		} else if (!strcmp("put", cmd) && l1 && l2) {
+
+			/* create entry with key = p1, value = p2 */
+			entry = alloc_test_entry(hash, p1, l1, p2, l2);
+
+			/* add / replace entry */
+			entry = hashmap_put(&map, entry);
+
+			/* print and free replaced entry, if any */
+			puts(entry ? get_value(entry) : "NULL");
+			free(entry);
+
+		} else if (!strcmp("get", cmd) && l1) {
+
+			/* setup static key */
+			test_key key;
+			hashmap_entry_init(&key, hash, 1);
+			key.key = p1;
+
+			/* lookup entry in hashmap */
+			entry = hashmap_get(&map, &key);
+
+			/* print result */
+			if (!entry)
+				puts("NULL");
+			while (entry) {
+				puts(get_value(entry));
+				entry = hashmap_get_next(&map, entry);
+			}
+
+		} else if (!strcmp("remove", cmd) && l1) {
+
+			/* setup static key */
+			test_key key;
+			hashmap_entry_init(&key, hash, 1);
+			key.key = p1;
+
+			/* remove entry from hashmap */
+			entry = hashmap_remove(&map, &key);
+
+			/* print result and free entry*/
+			puts(entry ? get_value(entry) : "NULL");
+			free(entry);
+
+		} else if (!strcmp("iterate", cmd)) {
+
+			hashmap_iter iter;
+			hashmap_iter_init(&map, &iter);
+			while ((entry = hashmap_iter_next(&iter)))
+				printf("%s %s\n", get_key(entry), get_value(entry));
+
+		} else if (!strcmp("size", cmd)) {
+
+			/* print table sizes */
+			printf("%u %u\n", map.tablesize, map.size);
+
+		} else if (!strcmp("perfhashmap", cmd) && l1 && l2) {
+
+			perf_hashmap(atoi(p1), atoi(p2));
+
+		} else if (!strcmp("perfhashtable", cmd) && l1 && l2) {
+
+			perf_hashtable(atoi(p1), atoi(p2));
+
+		} else {
+
+			printf("Unknown command %s\n", cmd);
+
+		}
+	}
+
+	hashmap_free(&map, free);
+	return 0;
+}
-- 
1.8.4.8243.gbcbdefd

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH/RFC 2/5] buitin/describe.c: use new hash map implementation
  2013-09-10 23:27 [PATCH/RFC 0/5] New hash table implementation Karsten Blees
  2013-09-10 23:28 ` [PATCH/RFC 1/5] add a hashtable implementation that supports O(1) removal Karsten Blees
@ 2013-09-10 23:28 ` Karsten Blees
  2013-09-10 23:29 ` [PATCH/RFC 3/5] diffcore-rename.c: move code around to prepare for the next patch Karsten Blees
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-10 23:28 UTC (permalink / raw
  To: Git List

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 builtin/describe.c | 53 ++++++++++++++++++++++++-----------------------------
 1 file changed, 24 insertions(+), 29 deletions(-)

diff --git a/builtin/describe.c b/builtin/describe.c
index 7d73722..bbc7159 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -6,7 +6,7 @@
 #include "exec_cmd.h"
 #include "parse-options.h"
 #include "diff.h"
-#include "hash.h"
+#include "hashmap.h"
 #include "argv-array.h"
 
 #define SEEN		(1u<<0)
@@ -25,7 +25,7 @@ static int longformat;
 static int first_parent;
 static int abbrev = -1; /* unspecified */
 static int max_candidates = 10;
-static struct hash_table names;
+static hashmap names;
 static int have_util;
 static const char *pattern;
 static int always;
@@ -38,7 +38,7 @@ static const char *diff_index_args[] = {
 
 
 struct commit_name {
-	struct commit_name *next;
+	hashmap_entry entry;
 	unsigned char peeled[20];
 	struct tag *tag;
 	unsigned prio:2; /* annotated tag = 2, tag = 1, head = 0 */
@@ -50,6 +50,11 @@ static const char *prio_names[] = {
 	"head", "lightweight", "annotated",
 };
 
+static int commit_name_cmp(struct commit_name *cn1, struct commit_name *cn2)
+{
+	return hashcmp(cn1->peeled, cn2->peeled);
+}
+
 static inline unsigned int hash_sha1(const unsigned char *sha1)
 {
 	unsigned int hash;
@@ -59,21 +64,10 @@ static inline unsigned int hash_sha1(const unsigned char *sha1)
 
 static inline struct commit_name *find_commit_name(const unsigned char *peeled)
 {
-	struct commit_name *n = lookup_hash(hash_sha1(peeled), &names);
-	while (n && !!hashcmp(peeled, n->peeled))
-		n = n->next;
-	return n;
-}
-
-static int set_util(void *chain, void *data)
-{
-	struct commit_name *n;
-	for (n = chain; n; n = n->next) {
-		struct commit *c = lookup_commit_reference_gently(n->peeled, 1);
-		if (c)
-			c->util = n;
-	}
-	return 0;
+	struct commit_name key;
+	hashmap_entry_init(&key, hash_sha1(peeled), 0);
+	hashcpy(key.peeled, peeled);
+	return hashmap_get(&names, &key);
 }
 
 static int replace_name(struct commit_name *e,
@@ -118,16 +112,10 @@ static void add_to_known_names(const char *path,
 	struct tag *tag = NULL;
 	if (replace_name(e, prio, sha1, &tag)) {
 		if (!e) {
-			void **pos;
 			e = xmalloc(sizeof(struct commit_name));
 			hashcpy(e->peeled, peeled);
-			pos = insert_hash(hash_sha1(peeled), e, &names);
-			if (pos) {
-				e->next = *pos;
-				*pos = e;
-			} else {
-				e->next = NULL;
-			}
+			hashmap_entry_init(e, hash_sha1(peeled), 0);
+			hashmap_add(&names, e);
 			e->path = NULL;
 		}
 		e->tag = tag;
@@ -292,7 +280,14 @@ static void describe(const char *arg, int last_one)
 		fprintf(stderr, _("searching to describe %s\n"), arg);
 
 	if (!have_util) {
-		for_each_hash(&names, set_util, NULL);
+		hashmap_iter iter;
+		struct commit *c;
+		struct commit_name *n = hashmap_iter_first(&names, &iter);
+		for (; n; n = hashmap_iter_next(&iter)) {
+			c = lookup_commit_reference_gently(n->peeled, 1);
+			if (c)
+				c->util = n;
+		}
 		have_util = 1;
 	}
 
@@ -463,9 +458,9 @@ int cmd_describe(int argc, const char **argv, const char *prefix)
 		return cmd_name_rev(args.argc, args.argv, prefix);
 	}
 
-	init_hash(&names);
+	hashmap_init(&names, (hashmap_cmp_fn) commit_name_cmp, 0);
 	for_each_rawref(get_name, NULL);
-	if (!names.nr && !always)
+	if (!names.size && !always)
 		die(_("No names found, cannot describe anything."));
 
 	if (argc == 0) {
-- 
1.8.4.8243.gbcbdefd

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH/RFC 3/5] diffcore-rename.c: move code around to prepare for the next patch
  2013-09-10 23:27 [PATCH/RFC 0/5] New hash table implementation Karsten Blees
  2013-09-10 23:28 ` [PATCH/RFC 1/5] add a hashtable implementation that supports O(1) removal Karsten Blees
  2013-09-10 23:28 ` [PATCH/RFC 2/5] buitin/describe.c: use new hash map implementation Karsten Blees
@ 2013-09-10 23:29 ` Karsten Blees
  2013-09-10 23:30 ` [PATCH/RFC 4/5] diffcore-rename.c: simplify finding exact renames Karsten Blees
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-10 23:29 UTC (permalink / raw
  To: Git List

No actual code changes, just move hash_filespec up and outdent part of
find_identical_files.

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 diffcore-rename.c | 98 +++++++++++++++++++++++++++----------------------------
 1 file changed, 49 insertions(+), 49 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 6c7a72f..008a60c 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -248,6 +248,18 @@ struct file_similarity {
 	struct file_similarity *next;
 };
 
+static unsigned int hash_filespec(struct diff_filespec *filespec)
+{
+	unsigned int hash;
+	if (!filespec->sha1_valid) {
+		if (diff_populate_filespec(filespec, 0))
+			return 0;
+		hash_sha1_file(filespec->data, filespec->size, "blob", filespec->sha1);
+	}
+	memcpy(&hash, filespec->sha1, sizeof(hash));
+	return hash;
+}
+
 static int find_identical_files(struct file_similarity *src,
 				struct file_similarity *dst,
 				struct diff_options *options)
@@ -258,46 +270,46 @@ static int find_identical_files(struct file_similarity *src,
 	 * Walk over all the destinations ...
 	 */
 	do {
-		struct diff_filespec *target = dst->filespec;
-		struct file_similarity *p, *best;
-		int i = 100, best_score = -1;
-
-		/*
-		 * .. to find the best source match
-		 */
-		best = NULL;
-		for (p = src; p; p = p->next) {
-			int score;
-			struct diff_filespec *source = p->filespec;
-
-			/* False hash collision? */
-			if (hashcmp(source->sha1, target->sha1))
-				continue;
-			/* Non-regular files? If so, the modes must match! */
-			if (!S_ISREG(source->mode) || !S_ISREG(target->mode)) {
-				if (source->mode != target->mode)
-					continue;
-			}
-			/* Give higher scores to sources that haven't been used already */
-			score = !source->rename_used;
-			if (source->rename_used && options->detect_rename != DIFF_DETECT_COPY)
-				continue;
-			score += basename_same(source, target);
-			if (score > best_score) {
-				best = p;
-				best_score = score;
-				if (score == 2)
-					break;
-			}
+	struct diff_filespec *target = dst->filespec;
+	struct file_similarity *p, *best;
+	int i = 100, best_score = -1;
 
-			/* Too many identical alternatives? Pick one */
-			if (!--i)
-				break;
+	/*
+	 * .. to find the best source match
+	 */
+	best = NULL;
+	for (p = src; p; p = p->next) {
+		int score;
+		struct diff_filespec *source = p->filespec;
+
+		/* False hash collision? */
+		if (hashcmp(source->sha1, target->sha1))
+			continue;
+		/* Non-regular files? If so, the modes must match! */
+		if (!S_ISREG(source->mode) || !S_ISREG(target->mode)) {
+			if (source->mode != target->mode)
+				continue;
 		}
-		if (best) {
-			record_rename_pair(dst->index, best->index, MAX_SCORE);
-			renames++;
+		/* Give higher scores to sources that haven't been used already */
+		score = !source->rename_used;
+		if (source->rename_used && options->detect_rename != DIFF_DETECT_COPY)
+			continue;
+		score += basename_same(source, target);
+		if (score > best_score) {
+			best = p;
+			best_score = score;
+			if (score == 2)
+				break;
 		}
+
+		/* Too many identical alternatives? Pick one */
+		if (!--i)
+			break;
+	}
+	if (best) {
+		record_rename_pair(dst->index, best->index, MAX_SCORE);
+		renames++;
+	}
 	} while ((dst = dst->next) != NULL);
 	return renames;
 }
@@ -343,18 +355,6 @@ static int find_same_files(void *ptr, void *data)
 	return ret;
 }
 
-static unsigned int hash_filespec(struct diff_filespec *filespec)
-{
-	unsigned int hash;
-	if (!filespec->sha1_valid) {
-		if (diff_populate_filespec(filespec, 0))
-			return 0;
-		hash_sha1_file(filespec->data, filespec->size, "blob", filespec->sha1);
-	}
-	memcpy(&hash, filespec->sha1, sizeof(hash));
-	return hash;
-}
-
 static void insert_file_table(struct hash_table *table, int src_dst, int index, struct diff_filespec *filespec)
 {
 	void **pos;
-- 
1.8.4.8243.gbcbdefd

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH/RFC 4/5] diffcore-rename.c: simplify finding exact renames
  2013-09-10 23:27 [PATCH/RFC 0/5] New hash table implementation Karsten Blees
                   ` (2 preceding siblings ...)
  2013-09-10 23:29 ` [PATCH/RFC 3/5] diffcore-rename.c: move code around to prepare for the next patch Karsten Blees
@ 2013-09-10 23:30 ` Karsten Blees
  2013-09-10 23:30 ` [PATCH/RFC 5/5] diffcore-rename.c: use new hash map implementation Karsten Blees
  2013-09-24  9:50 ` [PATCH v2 0/5] New hash table implementation Karsten Blees
  5 siblings, 0 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-10 23:30 UTC (permalink / raw
  To: Git List

The find_exact_renames function currently only uses the hash table for
grouping, i.e.:

1. add sources
2. add destinations
3. iterate all buckets, per bucket:
4. split sources from destinations
5. iterate destinations, per destination:
6. iterate sources to find best match

This can be simplified by utilizing the lookup functionality of the hash
table, i.e.:

1. add sources
2. iterate destinations, per destination:
3. lookup sources matching the current destination
4. iterate sources to find best match

This saves several iterations and file_similarity allocations for the
destinations.

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 diffcore-rename.c | 75 +++++++++++++++----------------------------------------
 1 file changed, 20 insertions(+), 55 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 008a60c..82b7975 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -243,7 +243,7 @@ static int score_compare(const void *a_, const void *b_)
 }
 
 struct file_similarity {
-	int src_dst, index;
+	int index;
 	struct diff_filespec *filespec;
 	struct file_similarity *next;
 };
@@ -260,25 +260,21 @@ static unsigned int hash_filespec(struct diff_filespec *filespec)
 	return hash;
 }
 
-static int find_identical_files(struct file_similarity *src,
-				struct file_similarity *dst,
+static int find_identical_files(struct hash_table *srcs,
+				int dst_index,
 				struct diff_options *options)
 {
 	int renames = 0;
 
-	/*
-	 * Walk over all the destinations ...
-	 */
-	do {
-	struct diff_filespec *target = dst->filespec;
+	struct diff_filespec *target = rename_dst[dst_index].two;
 	struct file_similarity *p, *best;
 	int i = 100, best_score = -1;
 
 	/*
-	 * .. to find the best source match
+	 * Find the best source match for specified destination.
 	 */
 	best = NULL;
-	for (p = src; p; p = p->next) {
+	for (p = lookup_hash(hash_filespec(target), srcs); p; p = p->next) {
 		int score;
 		struct diff_filespec *source = p->filespec;
 
@@ -307,61 +303,28 @@ static int find_identical_files(struct file_similarity *src,
 			break;
 	}
 	if (best) {
-		record_rename_pair(dst->index, best->index, MAX_SCORE);
+		record_rename_pair(dst_index, best->index, MAX_SCORE);
 		renames++;
 	}
-	} while ((dst = dst->next) != NULL);
 	return renames;
 }
 
-static void free_similarity_list(struct file_similarity *p)
+static int free_similarity_list(void *p, void *unused)
 {
 	while (p) {
 		struct file_similarity *entry = p;
-		p = p->next;
+		p = entry->next;
 		free(entry);
 	}
+	return 0;
 }
 
-static int find_same_files(void *ptr, void *data)
-{
-	int ret;
-	struct file_similarity *p = ptr;
-	struct file_similarity *src = NULL, *dst = NULL;
-	struct diff_options *options = data;
-
-	/* Split the hash list up into sources and destinations */
-	do {
-		struct file_similarity *entry = p;
-		p = p->next;
-		if (entry->src_dst < 0) {
-			entry->next = src;
-			src = entry;
-		} else {
-			entry->next = dst;
-			dst = entry;
-		}
-	} while (p);
-
-	/*
-	 * If we have both sources *and* destinations, see if
-	 * we can match them up
-	 */
-	ret = (src && dst) ? find_identical_files(src, dst, options) : 0;
-
-	/* Free the hashes and return the number of renames found */
-	free_similarity_list(src);
-	free_similarity_list(dst);
-	return ret;
-}
-
-static void insert_file_table(struct hash_table *table, int src_dst, int index, struct diff_filespec *filespec)
+static void insert_file_table(struct hash_table *table, int index, struct diff_filespec *filespec)
 {
 	void **pos;
 	unsigned int hash;
 	struct file_similarity *entry = xmalloc(sizeof(*entry));
 
-	entry->src_dst = src_dst;
 	entry->index = index;
 	entry->filespec = filespec;
 	entry->next = NULL;
@@ -385,24 +348,26 @@ static void insert_file_table(struct hash_table *table, int src_dst, int index,
  */
 static int find_exact_renames(struct diff_options *options)
 {
-	int i;
+	int i, renames;
 	struct hash_table file_table;
 
+	/* Add all sources to the hash table */
 	init_hash(&file_table);
-	preallocate_hash(&file_table, rename_src_nr + rename_dst_nr);
+	preallocate_hash(&file_table, rename_src_nr);
 	for (i = 0; i < rename_src_nr; i++)
-		insert_file_table(&file_table, -1, i, rename_src[i].p->one);
+		insert_file_table(&file_table, i, rename_src[i].p->one);
 
+	/* Walk the destinations and find best source match */
 	for (i = 0; i < rename_dst_nr; i++)
-		insert_file_table(&file_table, 1, i, rename_dst[i].two);
+		renames += find_identical_files(&file_table, i, options);
 
-	/* Find the renames */
-	i = for_each_hash(&file_table, find_same_files, options);
+	/* Free source file_similarity chains */
+	for_each_hash(&file_table, free_similarity_list, options);
 
 	/* .. and free the hash data structure */
 	free_hash(&file_table);
 
-	return i;
+	return renames;
 }
 
 #define NUM_CANDIDATE_PER_DST 4
-- 
1.8.4.8243.gbcbdefd

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH/RFC 5/5] diffcore-rename.c: use new hash map implementation
  2013-09-10 23:27 [PATCH/RFC 0/5] New hash table implementation Karsten Blees
                   ` (3 preceding siblings ...)
  2013-09-10 23:30 ` [PATCH/RFC 4/5] diffcore-rename.c: simplify finding exact renames Karsten Blees
@ 2013-09-10 23:30 ` Karsten Blees
  2013-09-24  9:50 ` [PATCH v2 0/5] New hash table implementation Karsten Blees
  5 siblings, 0 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-10 23:30 UTC (permalink / raw
  To: Git List

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 diffcore-rename.c | 48 +++++++++++++-----------------------------------
 1 file changed, 13 insertions(+), 35 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 82b7975..6271af9 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -4,7 +4,7 @@
 #include "cache.h"
 #include "diff.h"
 #include "diffcore.h"
-#include "hash.h"
+#include "hashmap.h"
 #include "progress.h"
 
 /* Table of rename/copy destinations */
@@ -243,9 +243,9 @@ static int score_compare(const void *a_, const void *b_)
 }
 
 struct file_similarity {
+	hashmap_entry entry;
 	int index;
 	struct diff_filespec *filespec;
-	struct file_similarity *next;
 };
 
 static unsigned int hash_filespec(struct diff_filespec *filespec)
@@ -260,21 +260,22 @@ static unsigned int hash_filespec(struct diff_filespec *filespec)
 	return hash;
 }
 
-static int find_identical_files(struct hash_table *srcs,
+static int find_identical_files(hashmap *srcs,
 				int dst_index,
 				struct diff_options *options)
 {
 	int renames = 0;
 
 	struct diff_filespec *target = rename_dst[dst_index].two;
-	struct file_similarity *p, *best;
+	struct file_similarity *p, *best, dst;
 	int i = 100, best_score = -1;
 
 	/*
 	 * Find the best source match for specified destination.
 	 */
 	best = NULL;
-	for (p = lookup_hash(hash_filespec(target), srcs); p; p = p->next) {
+	hashmap_entry_init(&dst, hash_filespec(target), 0);
+	for (p = hashmap_get(srcs, &dst); p; p = hashmap_get_next(srcs, p)) {
 		int score;
 		struct diff_filespec *source = p->filespec;
 
@@ -309,34 +310,15 @@ static int find_identical_files(struct hash_table *srcs,
 	return renames;
 }
 
-static int free_similarity_list(void *p, void *unused)
+static void insert_file_table(hashmap *table, int index, struct diff_filespec *filespec)
 {
-	while (p) {
-		struct file_similarity *entry = p;
-		p = entry->next;
-		free(entry);
-	}
-	return 0;
-}
-
-static void insert_file_table(struct hash_table *table, int index, struct diff_filespec *filespec)
-{
-	void **pos;
-	unsigned int hash;
 	struct file_similarity *entry = xmalloc(sizeof(*entry));
 
 	entry->index = index;
 	entry->filespec = filespec;
-	entry->next = NULL;
-
-	hash = hash_filespec(filespec);
-	pos = insert_hash(hash, entry, table);
 
-	/* We already had an entry there? */
-	if (pos) {
-		entry->next = *pos;
-		*pos = entry;
-	}
+	hashmap_entry_init(entry, hash_filespec(filespec), 0);
+	hashmap_add(table, entry);
 }
 
 /*
@@ -349,11 +331,10 @@ static void insert_file_table(struct hash_table *table, int index, struct diff_f
 static int find_exact_renames(struct diff_options *options)
 {
 	int i, renames;
-	struct hash_table file_table;
+	hashmap file_table;
 
 	/* Add all sources to the hash table */
-	init_hash(&file_table);
-	preallocate_hash(&file_table, rename_src_nr);
+	hashmap_init(&file_table, NULL, rename_src_nr);
 	for (i = 0; i < rename_src_nr; i++)
 		insert_file_table(&file_table, i, rename_src[i].p->one);
 
@@ -361,11 +342,8 @@ static int find_exact_renames(struct diff_options *options)
 	for (i = 0; i < rename_dst_nr; i++)
 		renames += find_identical_files(&file_table, i, options);
 
-	/* Free source file_similarity chains */
-	for_each_hash(&file_table, free_similarity_list, options);
-
-	/* .. and free the hash data structure */
-	free_hash(&file_table);
+	/* Free the hash data structure and entries */
+	hashmap_free(&file_table, free);
 
 	return renames;
 }
-- 
1.8.4.8243.gbcbdefd

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH/RFC 1/5] add a hashtable implementation that supports O(1) removal
  2013-09-10 23:28 ` [PATCH/RFC 1/5] add a hashtable implementation that supports O(1) removal Karsten Blees
@ 2013-09-11 23:56   ` Junio C Hamano
  2013-09-23  9:16     ` Karsten Blees
  2013-09-12  4:10   ` Junio C Hamano
  1 sibling, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2013-09-11 23:56 UTC (permalink / raw
  To: Karsten Blees; +Cc: Git List

Karsten Blees <karsten.blees@gmail.com> writes:

> +#define FNV32_BASE ((unsigned int) 0x811c9dc5)
> +#define FNV32_PRIME ((unsigned int) 0x01000193)
> + ...
> +static inline unsigned int bucket(const hashmap *map, const hashmap_entry *key)
> +{
> +	return key->hash & (map->tablesize - 1);
> +}

As tablesize would hopefully be reasonably small, not worrying about
platforms' "unsigned int" being 64-bit (in which case it would be
more appropriate to compute with FNV64_PRIME) should be fine.

> +static inline hashmap_entry **find_entry(const hashmap *map,
> +		const hashmap_entry *key)
> +{
> +	hashmap_entry **e = &map->table[bucket(map, key)];
> +	while (*e && !entry_equals(map, *e, key))
> +		e = &(*e)->next;
> +	return e;
> +}

(mental note) This finds the location the pointer to the entry is
stored, not the entry itself.

> +void *hashmap_get(const hashmap *map, const void *key)
> +{
> +	return *find_entry(map, key);
> +}

... which is consistent with this, and more importantly, it is
crucial for hashmap_remove()'s implementation, because...

> +void *hashmap_remove(hashmap *map, const void *key)
> +{
> +	hashmap_entry *old;
> +	hashmap_entry **e = find_entry(map, key);
> +	if (!*e)
> +		return NULL;
> +
> +	/* remove existing entry */
> +	old = *e;
> +	*e = old->next;

... this wants to update the linked list in place.

Looking good.

I however wonder if the singly linked linear chain is a really good
alternative for the access pattern of the hashes we use, though.  Do
we really want to trigger growth on the bucket load factor, not the
length of the longest chain, for example?

> +	old->next = NULL;
> +
> +	/* fix size and rehash if appropriate */
> +	map->size--;
> +	if (map->tablesize > HASHMAP_INITIAL_SIZE &&
> +		map->size * HASHMAP_SHRINK_AT < map->tablesize)
> +		rehash(map, map->tablesize >> HASHMAP_GROW);

Please align the first two lines so that the first non-whitespace on
the second line of the condition part of the "if" statement
(i.e. 'm') aligns with the first non-whitespace inside the '(' open
parenthesis (i.e. 'm').

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH/RFC 1/5] add a hashtable implementation that supports O(1) removal
  2013-09-10 23:28 ` [PATCH/RFC 1/5] add a hashtable implementation that supports O(1) removal Karsten Blees
  2013-09-11 23:56   ` Junio C Hamano
@ 2013-09-12  4:10   ` Junio C Hamano
  2013-09-23  9:21     ` Karsten Blees
  1 sibling, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2013-09-12  4:10 UTC (permalink / raw
  To: Karsten Blees; +Cc: Git List

Karsten Blees <karsten.blees@gmail.com> writes:

> +/*
> + * Hashmap entry data structure, intended to be used as first member of user
> + * data structures. Consists of a pointer and an int. Ideally it should be

It is technically correct to say this is "intended to be" used, but
to those who are using this API, it would be more helpful to say "a
user data structure that uses this API *must* have this as its first
member field".

> + * followed by an int-sized member to prevent unused memory on 64-bit systems
> + * due to alignment.
> + */
> +typedef struct hashmap_entry {
> +	struct hashmap_entry *next;
> +	unsigned int hash;
> +} hashmap_entry;
> + ...
> +typedef struct hashmap {
> +	hashmap_entry **table;
> +	hashmap_cmp_fn cmpfn;
> +	unsigned int size, tablesize;
> +} hashmap;

I forgot to mention in my previous message, but we find that the
code tends to be easier to read if we avoid using typedef'ed struct
like these.  E.g. in 2/5 we see something like this:

     static int abbrev = -1; /* unspecified */
     static int max_candidates = 10;
    -static struct hash_table names;
    +static hashmap names;
     static int have_util;
     static const char *pattern;
     static int always;
    @@ -38,7 +38,7 @@ static const char *diff_index_args[] = {


     struct commit_name {
    -	struct commit_name *next;
    +	hashmap_entry entry;
            unsigned char peeled[20];

The version before the patch is preferrable.

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH/RFC 1/5] add a hashtable implementation that supports O(1) removal
  2013-09-11 23:56   ` Junio C Hamano
@ 2013-09-23  9:16     ` Karsten Blees
  0 siblings, 0 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-23  9:16 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Git List

Sorry for the delay, I've been on vacation...

Am 12.09.2013 01:56, schrieb Junio C Hamano:
> Karsten Blees <karsten.blees@gmail.com> writes:
> 
>> +#define FNV32_BASE ((unsigned int) 0x811c9dc5)
>> +#define FNV32_PRIME ((unsigned int) 0x01000193)
>> + ...
>> +static inline unsigned int bucket(const hashmap *map, const hashmap_entry *key)
>> +{
>> +	return key->hash & (map->tablesize - 1);
>> +}
> 
> As tablesize would hopefully be reasonably small, not worrying about
> platforms' "unsigned int" being 64-bit (in which case it would be
> more appropriate to compute with FNV64_PRIME) should be fine.
> 
>> +static inline hashmap_entry **find_entry(const hashmap *map,
>> +		const hashmap_entry *key)
>> +{
>> +	hashmap_entry **e = &map->table[bucket(map, key)];
>> +	while (*e && !entry_equals(map, *e, key))
>> +		e = &(*e)->next;
>> +	return e;
>> +}
> 
> (mental note) This finds the location the pointer to the entry is
> stored, not the entry itself.
> 

Will rename to find_entry_ptr to make this clear

>> +void *hashmap_get(const hashmap *map, const void *key)
>> +{
>> +	return *find_entry(map, key);
>> +}
> 
> ... which is consistent with this, and more importantly, it is
> crucial for hashmap_remove()'s implementation, because...
> 
>> +void *hashmap_remove(hashmap *map, const void *key)
>> +{
>> +	hashmap_entry *old;
>> +	hashmap_entry **e = find_entry(map, key);
>> +	if (!*e)
>> +		return NULL;
>> +
>> +	/* remove existing entry */
>> +	old = *e;
>> +	*e = old->next;
> 
> ... this wants to update the linked list in place.
> 
> Looking good.
> 
> I however wonder if the singly linked linear chain is a really good
> alternative for the access pattern of the hashes we use, though.

I don't think it will make a big difference in lookup performance, especially with good hash codes (such as the first bytes of a sha1). In theory, chaining should be slightly faster, because entries are strictly confined to their buckets (i.e. no extraneous entries need to be traversed when looking up an entry). With uniform hash distribution, chaining should require (1 + loadfactor) comparisons to find an entry, while open addressing requires (1/(1 - loadfactor)) [1]. I'll add a performance test for lookups, though.

[1] http://en.wikipedia.org/wiki/Hash_table#Performance_analysis

> Do we really want to trigger growth on the bucket load factor, not the
> length of the longest chain, for example?
> 

With good hashes and a load factor < 1, the longest 'chain' is 1. We only get chains in case of collisions, which however cannot necessarily be resolved by resizing. E.g. in the worst case, all entries have the same hash code, which deforms the hash table into a long linked list in a single bucket. Resizing won't change that.

An alternative would be to resize on the number of used buckets instead of total entries. I.e. with exceptionally bad hash codes and lots of collisions, we at least don't waste memory by making the table unnecessarily large. However, I don't think this is worth the extra effort.

>> +	old->next = NULL;
>> +
>> +	/* fix size and rehash if appropriate */
>> +	map->size--;
>> +	if (map->tablesize > HASHMAP_INITIAL_SIZE &&
>> +		map->size * HASHMAP_SHRINK_AT < map->tablesize)
>> +		rehash(map, map->tablesize >> HASHMAP_GROW);
> 
> Please align the first two lines so that the first non-whitespace on
> the second line of the condition part of the "if" statement
> (i.e. 'm') aligns with the first non-whitespace inside the '(' open
> parenthesis (i.e. 'm').
> 

Will do.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH/RFC 1/5] add a hashtable implementation that supports O(1) removal
  2013-09-12  4:10   ` Junio C Hamano
@ 2013-09-23  9:21     ` Karsten Blees
  0 siblings, 0 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-23  9:21 UTC (permalink / raw
  To: Junio C Hamano; +Cc: Git List

Am 12.09.2013 06:10, schrieb Junio C Hamano:
> Karsten Blees <karsten.blees@gmail.com> writes:
> 
>> +/*
>> + * Hashmap entry data structure, intended to be used as first member of user
>> + * data structures. Consists of a pointer and an int. Ideally it should be
> 
> It is technically correct to say this is "intended to be" used, but
> to those who are using this API, it would be more helpful to say "a
> user data structure that uses this API *must* have this as its first
> member field".
> 

Right. I considered making the position in the user struct configurable via some offsetof() magic, but this would have just complicated things unnecessarily.

>> + * followed by an int-sized member to prevent unused memory on 64-bit systems
>> + * due to alignment.
>> + */
>> +typedef struct hashmap_entry {
>> +	struct hashmap_entry *next;
>> +	unsigned int hash;
>> +} hashmap_entry;
>> + ...
>> +typedef struct hashmap {
>> +	hashmap_entry **table;
>> +	hashmap_cmp_fn cmpfn;
>> +	unsigned int size, tablesize;
>> +} hashmap;
> 
> I forgot to mention in my previous message, but we find that the
> code tends to be easier to read if we avoid using typedef'ed struct
> like these.  E.g. in 2/5 we see something like this:
> 
>      static int abbrev = -1; /* unspecified */
>      static int max_candidates = 10;
>     -static struct hash_table names;
>     +static hashmap names;
>      static int have_util;
>      static const char *pattern;
>      static int always;
>     @@ -38,7 +38,7 @@ static const char *diff_index_args[] = {
> 
> 
>      struct commit_name {
>     -	struct commit_name *next;
>     +	hashmap_entry entry;
>             unsigned char peeled[20];
> 
> The version before the patch is preferrable.
> 

OK

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 0/5] New hash table implementation
  2013-09-10 23:27 [PATCH/RFC 0/5] New hash table implementation Karsten Blees
                   ` (4 preceding siblings ...)
  2013-09-10 23:30 ` [PATCH/RFC 5/5] diffcore-rename.c: use new hash map implementation Karsten Blees
@ 2013-09-24  9:50 ` Karsten Blees
  2013-09-24  9:51   ` [PATCH v2 1/5] add a hashtable implementation that supports O(1) removal Karsten Blees
                     ` (7 more replies)
  5 siblings, 8 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-24  9:50 UTC (permalink / raw
  To: Git List

Also here: https://github.com/kblees/git/tree/kb/hashmap-v2

Changes since initial version:
- removed struct typedefs
- clarified documentation of hashmap_entry
- renamed find_entry -> find_entry_ptr
- added performance tests for lookup


I've also tried to resize the table based on the number of used buckets (instead of total entries). However, this doesn't work with hash functions that produce 'gaps'. E.g. a hash function that returns only even numbers would fill only every second bucket (i.e. max. 50% buckets used), and the 80% resize threshold is never reached.


Regarding performance, I have to admit that the difference between the two implementations is far greater than I had anticipated. The following times (in seconds) are from Linux x64 (Debian Sarge) on a Core i7 860 @2.8GHz. All tests have been run with 1,000 rounds of 100,000 entries each.

The 'get 10% hits' test does 100,000 lookups on a table with 10,000 entries (i.e. 90% unsuccessful lookups).

The rows denote different hash functions with different qualities:
- FNV: FNV-1 hash on stringified loop counter (i.e. fnv1(itoa(i))), as
  an example of a high quality / low collision hash
- i: just the loop counter (i.e. 0, 1, 2,...), guaranteed collision free
- i/10: every 10 entries share the same hash code, lots of collisions

The i and i/10 tests show that open addressing suffers badly from clustering, i.e. with adjacent hash codes, it degrades to linear search. The *2 versions provide for some space between used buckets to better compare it to the chaining version.


        |       add        |  get 100% hits  |    get 10% hits
        |  hash  | hashmap | hash  | hashmap |  hash   | hashmap
--------+--------+---------+-------+---------+---------+--------
FNV     | 14.815 |   2.345 | 3.059 |   1.642 |   4.085 |   0.976
FNV  x2 | 14.409 |   2.706 | 2.888 |   1.959 |   3.905 |   1.393
i       |  7.432 |   1.593 | 1.364 |   1.142 | 413.023 |   0.589
i    x2 |  9.169 |   1.866 | 1.427 |   1.163 |   0.757 |   0.670
i/10    |  1.800 |   1.555 | 5.365 |   6.465 |  32.918 |   1.052
i/10 x2 |  1.892 |   1.555 | 5.386 |   6.474 |   1.123 |   1.206

Tests can be reproduced with 'time echo "perfhash[map] <method> 1000" | ./test-hashmap', see test-hashmap.c for definition of method flags.


Karsten Blees (5):
  add a hashtable implementation that supports O(1) removal
  buitin/describe.c: use new hash map implementation
  diffcore-rename.c: move code around to prepare for the next patch
  diffcore-rename.c: simplify finding exact renames
  diffcore-rename.c: use new hash map implementation

 Makefile           |   3 +
 builtin/describe.c |  53 ++++----
 diffcore-rename.c  | 185 ++++++++++------------------
 hashmap.c          | 211 +++++++++++++++++++++++++++++++
 hashmap.h          | 200 ++++++++++++++++++++++++++++++
 t/t0011-hashmap.sh | 236 +++++++++++++++++++++++++++++++++++
 test-hashmap.c     | 354 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 1092 insertions(+), 150 deletions(-)
 create mode 100644 hashmap.c
 create mode 100644 hashmap.h
 create mode 100755 t/t0011-hashmap.sh
 create mode 100644 test-hashmap.c


---
git diff kb/hashmap..kb/hashmap-v2:

diff --git a/builtin/describe.c b/builtin/describe.c
index bbc7159..5db5d89 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -25,7 +25,7 @@ static int longformat;
 static int first_parent;
 static int abbrev = -1; /* unspecified */
 static int max_candidates = 10;
-static hashmap names;
+static struct hashmap names;
 static int have_util;
 static const char *pattern;
 static int always;
@@ -38,7 +38,7 @@ static const char *diff_index_args[] = {
 
 
 struct commit_name {
-	hashmap_entry entry;
+	struct hashmap_entry entry;
 	unsigned char peeled[20];
 	struct tag *tag;
 	unsigned prio:2; /* annotated tag = 2, tag = 1, head = 0 */
@@ -280,7 +280,7 @@ static void describe(const char *arg, int last_one)
 		fprintf(stderr, _("searching to describe %s\n"), arg);
 
 	if (!have_util) {
-		hashmap_iter iter;
+		struct hashmap_iter iter;
 		struct commit *c;
 		struct commit_name *n = hashmap_iter_first(&names, &iter);
 		for (; n; n = hashmap_iter_next(&iter)) {
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 6271af9..2e70d31 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -243,7 +243,7 @@ static int score_compare(const void *a_, const void *b_)
 }
 
 struct file_similarity {
-	hashmap_entry entry;
+	struct hashmap_entry entry;
 	int index;
 	struct diff_filespec *filespec;
 };
@@ -260,7 +260,7 @@ static unsigned int hash_filespec(struct diff_filespec *filespec)
 	return hash;
 }
 
-static int find_identical_files(hashmap *srcs,
+static int find_identical_files(struct hashmap *srcs,
 				int dst_index,
 				struct diff_options *options)
 {
@@ -310,7 +310,7 @@ static int find_identical_files(hashmap *srcs,
 	return renames;
 }
 
-static void insert_file_table(hashmap *table, int index, struct diff_filespec *filespec)
+static void insert_file_table(struct hashmap *table, int index, struct diff_filespec *filespec)
 {
 	struct file_similarity *entry = xmalloc(sizeof(*entry));
 
@@ -331,7 +331,7 @@ static void insert_file_table(hashmap *table, int index, struct diff_filespec *f
 static int find_exact_renames(struct diff_options *options)
 {
 	int i, renames;
-	hashmap file_table;
+	struct hashmap file_table;
 
 	/* Add all sources to the hash table */
 	hashmap_init(&file_table, NULL, rename_src_nr);
diff --git a/hashmap.c b/hashmap.c
index 686ee6f..75a8578 100644
--- a/hashmap.c
+++ b/hashmap.c
@@ -58,28 +58,29 @@ unsigned int memihash(const void *buf, size_t len)
 /* shrink if < 16.6% full (to 66.6%) */
 #define HASHMAP_SHRINK_AT 6
 
-static inline int entry_equals(const hashmap *map, const hashmap_entry *e1,
-		const hashmap_entry *e2)
+static inline int entry_equals(const struct hashmap *map,
+		const struct hashmap_entry *e1, const struct hashmap_entry *e2)
 {
 	return (e1 == e2) || (e1->hash == e2->hash && !(*map->cmpfn)(e1, e2));
 }
 
-static inline unsigned int bucket(const hashmap *map, const hashmap_entry *key)
+static inline unsigned int bucket(const struct hashmap *map,
+		const struct hashmap_entry *key)
 {
 	return key->hash & (map->tablesize - 1);
 }
 
-static void rehash(hashmap *map, unsigned int newsize)
+static void rehash(struct hashmap *map, unsigned int newsize)
 {
 	unsigned int i, oldsize = map->tablesize;
-	hashmap_entry **oldtable = map->table;
+	struct hashmap_entry **oldtable = map->table;
 
 	map->tablesize = newsize;
-	map->table = xcalloc(sizeof(hashmap_entry*), map->tablesize);
+	map->table = xcalloc(sizeof(struct hashmap_entry*), map->tablesize);
 	for (i = 0; i < oldsize; i++) {
-		hashmap_entry *e = oldtable[i];
+		struct hashmap_entry *e = oldtable[i];
 		while (e) {
-			hashmap_entry *next = e->next;
+			struct hashmap_entry *next = e->next;
 			unsigned int b = bucket(map, e);
 			e->next = map->table[b];
 			map->table[b] = e;
@@ -89,10 +90,10 @@ static void rehash(hashmap *map, unsigned int newsize)
 	free(oldtable);
 }
 
-static inline hashmap_entry **find_entry(const hashmap *map,
-		const hashmap_entry *key)
+static inline struct hashmap_entry **find_entry_ptr(const struct hashmap *map,
+		const struct hashmap_entry *key)
 {
-	hashmap_entry **e = &map->table[bucket(map, key)];
+	struct hashmap_entry **e = &map->table[bucket(map, key)];
 	while (*e && !entry_equals(map, *e, key))
 		e = &(*e)->next;
 	return e;
@@ -103,7 +104,7 @@ static int always_equal(const void *unused1, const void *unused2)
 	return 0;
 }
 
-void hashmap_init(hashmap *map, hashmap_cmp_fn equals_function,
+void hashmap_init(struct hashmap *map, hashmap_cmp_fn equals_function,
 		size_t initial_size)
 {
 	map->size = 0;
@@ -113,16 +114,16 @@ void hashmap_init(hashmap *map, hashmap_cmp_fn equals_function,
 	initial_size *= HASHMAP_GROW_AT;
 	while (initial_size > map->tablesize)
 		map->tablesize <<= HASHMAP_GROW;
-	map->table = xcalloc(sizeof(hashmap_entry*), map->tablesize);
+	map->table = xcalloc(sizeof(struct hashmap_entry*), map->tablesize);
 }
 
-void hashmap_free(hashmap *map, hashmap_free_fn free_function)
+void hashmap_free(struct hashmap *map, hashmap_free_fn free_function)
 {
 	if (!map || !map->table)
 		return;
 	if (free_function) {
-		hashmap_iter iter;
-		hashmap_entry *e;
+		struct hashmap_iter iter;
+		struct hashmap_entry *e;
 		hashmap_iter_init(map, &iter);
 		while ((e = hashmap_iter_next(&iter)))
 			(*free_function)(e);
@@ -131,26 +132,26 @@ void hashmap_free(hashmap *map, hashmap_free_fn free_function)
 	memset(map, 0, sizeof(*map));
 }
 
-void *hashmap_get(const hashmap *map, const void *key)
+void *hashmap_get(const struct hashmap *map, const void *key)
 {
-	return *find_entry(map, key);
+	return *find_entry_ptr(map, key);
 }
 
-void *hashmap_get_next(const hashmap *map, const void *entry)
+void *hashmap_get_next(const struct hashmap *map, const void *entry)
 {
-	hashmap_entry *e = ((hashmap_entry*) entry)->next;
+	struct hashmap_entry *e = ((struct hashmap_entry*) entry)->next;
 	for (; e; e = e->next)
 		if (entry_equals(map, entry, e))
 			return e;
 	return NULL;
 }
 
-void hashmap_add(hashmap *map, void *entry)
+void hashmap_add(struct hashmap *map, void *entry)
 {
 	unsigned int b = bucket(map, entry);
 
 	/* add entry */
-	((hashmap_entry*) entry)->next = map->table[b];
+	((struct hashmap_entry*) entry)->next = map->table[b];
 	map->table[b] = entry;
 
 	/* fix size and rehash if appropriate */
@@ -159,10 +160,10 @@ void hashmap_add(hashmap *map, void *entry)
 		rehash(map, map->tablesize << HASHMAP_GROW);
 }
 
-void *hashmap_remove(hashmap *map, const void *key)
+void *hashmap_remove(struct hashmap *map, const void *key)
 {
-	hashmap_entry *old;
-	hashmap_entry **e = find_entry(map, key);
+	struct hashmap_entry *old;
+	struct hashmap_entry **e = find_entry_ptr(map, key);
 	if (!*e)
 		return NULL;
 
@@ -174,28 +175,28 @@ void *hashmap_remove(hashmap *map, const void *key)
 	/* fix size and rehash if appropriate */
 	map->size--;
 	if (map->tablesize > HASHMAP_INITIAL_SIZE &&
-		map->size * HASHMAP_SHRINK_AT < map->tablesize)
+	    map->size * HASHMAP_SHRINK_AT < map->tablesize)
 		rehash(map, map->tablesize >> HASHMAP_GROW);
 	return old;
 }
 
-void *hashmap_put(hashmap *map, void *entry)
+void *hashmap_put(struct hashmap *map, void *entry)
 {
-	hashmap_entry *old = hashmap_remove(map, entry);
+	struct hashmap_entry *old = hashmap_remove(map, entry);
 	hashmap_add(map, entry);
 	return old;
 }
 
-void hashmap_iter_init(hashmap *map, hashmap_iter *iter)
+void hashmap_iter_init(struct hashmap *map, struct hashmap_iter *iter)
 {
 	iter->map = map;
 	iter->tablepos = 0;
 	iter->next = NULL;
 }
 
-void *hashmap_iter_next(hashmap_iter *iter)
+void *hashmap_iter_next(struct hashmap_iter *iter)
 {
-	hashmap_entry *current = iter->next;
+	struct hashmap_entry *current = iter->next;
 	for (;;) {
 		if (current) {
 			iter->next = current->next;
diff --git a/hashmap.h b/hashmap.h
index 59f8489..98c4ebc 100644
--- a/hashmap.h
+++ b/hashmap.h
@@ -40,15 +40,15 @@ extern unsigned int memhash(const void *buf, size_t len);
 extern unsigned int memihash(const void *buf, size_t len);
 
 /*
- * Hashmap entry data structure, intended to be used as first member of user
- * data structures. Consists of a pointer and an int. Ideally it should be
- * followed by an int-sized member to prevent unused memory on 64-bit systems
- * due to alignment.
+ * Hashmap entry data structure, must be used as first member of user data
+ * structures. Consists of a pointer and an int. Ideally it should be followed
+ * by an int-sized member to prevent unused memory on 64-bit systems due to
+ * alignment.
  */
-typedef struct hashmap_entry {
+struct hashmap_entry {
 	struct hashmap_entry *next;
 	unsigned int hash;
-} hashmap_entry;
+};
 
 /*
  * User-supplied function to test two hashmap entries for equality, shall
@@ -67,20 +67,20 @@ typedef void (*hashmap_free_fn)(void *entry);
 /*
  * Hashmap data structure, use with hashmap_* functions.
  */
-typedef struct hashmap {
-	hashmap_entry **table;
+struct hashmap {
+	struct hashmap_entry **table;
 	hashmap_cmp_fn cmpfn;
 	unsigned int size, tablesize;
-} hashmap;
+};
 
 /*
  * Hashmap iterator data structure, use with hasmap_iter_* functions.
  */
-typedef struct hashmap_iter {
-	hashmap *map;
-	hashmap_entry *next;
+struct hashmap_iter {
+	struct hashmap *map;
+	struct hashmap_entry *next;
 	unsigned int tablepos;
-} hashmap_iter;
+};
 
 /*
  * Initializes a hashmap_entry structure.
@@ -90,9 +90,9 @@ typedef struct hashmap_iter {
  */
 static inline void hashmap_entry_init(void *entry, int hash, int key_only)
 {
-	hashmap_entry *e = entry;
+	struct hashmap_entry *e = entry;
 	e->hash = hash;
-	e->next = key_only ? (hashmap_entry*) -1 : NULL;
+	e->next = key_only ? (struct hashmap_entry*) -1 : NULL;
 }
 
 /*
@@ -108,8 +108,8 @@ static inline void hashmap_entry_init(void *entry, int hash, int key_only)
  */
 static inline int hashmap_entry_is_key(const void *entry)
 {
-	const hashmap_entry *e = entry;
-	return e->next == (hashmap_entry*) -1;
+	const struct hashmap_entry *e = entry;
+	return e->next == (struct hashmap_entry*) -1;
 }
 
 /*
@@ -119,7 +119,7 @@ static inline int hashmap_entry_is_key(const void *entry)
  *  NULL, entries are considered equal if their hash codes are equal.
  * initial_size: optional number of initial entries, 0 if unknown
  */
-extern void hashmap_init(hashmap *map, hashmap_cmp_fn equals_function,
+extern void hashmap_init(struct hashmap *map, hashmap_cmp_fn equals_function,
 		size_t initial_size);
 
 /*
@@ -127,7 +127,7 @@ extern void hashmap_init(hashmap *map, hashmap_cmp_fn equals_function,
  * map: hashmap to free
  * free_function: optional function to free the hashmap entries
  */
-extern void hashmap_free(hashmap *map, hashmap_free_fn free_function);
+extern void hashmap_free(struct hashmap *map, hashmap_free_fn free_function);
 
 /*
  * Returns the hashmap entry for the specified key, or NULL if not found.
@@ -135,7 +135,7 @@ extern void hashmap_free(hashmap *map, hashmap_free_fn free_function);
  * key: key of the entry to look up
  * returns matching hashmap entry, or NULL if not found
  */
-extern void *hashmap_get(const hashmap *map, const void *key);
+extern void *hashmap_get(const struct hashmap *map, const void *key);
 
 /*
  * Returns the next equal hashmap entry if the map contains duplicates (see
@@ -144,7 +144,7 @@ extern void *hashmap_get(const hashmap *map, const void *key);
  * entry: current entry, obtained via hashmap_get or hashmap_get_next
  * returns next equal hashmap entry, or NULL if not found
  */
-extern void *hashmap_get_next(const hashmap *map, const void *entry);
+extern void *hashmap_get_next(const struct hashmap *map, const void *entry);
 
 /*
  * Adds a hashmap entry. This allows to add duplicate entries (i.e. separate
@@ -152,7 +152,7 @@ extern void *hashmap_get_next(const hashmap *map, const void *entry);
  * map: the hashmap
  * entry: the entry to add
  */
-extern void hashmap_add(hashmap *map, void *entry);
+extern void hashmap_add(struct hashmap *map, void *entry);
 
 /*
  * Adds or replaces a hashmap entry.
@@ -160,7 +160,7 @@ extern void hashmap_add(hashmap *map, void *entry);
  * entry: the entry to add or replace
  * returns previous entry, or NULL if the entry is new
  */
-extern void *hashmap_put(hashmap *map, void *entry);
+extern void *hashmap_put(struct hashmap *map, void *entry);
 
 /*
  * Removes a hashmap entry matching the specified key.
@@ -168,21 +168,21 @@ extern void *hashmap_put(hashmap *map, void *entry);
  * key: key of the entry to remove
  * returns removed entry, or NULL if not found
  */
-extern void *hashmap_remove(hashmap *map, const void *key);
+extern void *hashmap_remove(struct hashmap *map, const void *key);
 
 /*
  * Initializes a hashmap iterator structure.
  * map: the hashmap
  * iter: hashmap iterator structure
  */
-extern void hashmap_iter_init(hashmap *map, hashmap_iter *iter);
+extern void hashmap_iter_init(struct hashmap *map, struct hashmap_iter *iter);
 
 /**
  * Returns the next hashmap entry.
  * iter: hashmap iterator
  * returns next entry, or NULL if there are no more entries
  */
-extern void *hashmap_iter_next(hashmap_iter *iter);
+extern void *hashmap_iter_next(struct hashmap_iter *iter);
 
 /**
  * Initializes a hashmap iterator and returns the first hashmap entry.
@@ -190,8 +190,8 @@ extern void *hashmap_iter_next(hashmap_iter *iter);
  * iter: hashmap iterator
  * returns first entry, or NULL if there are no entries
  */
-static inline void *hashmap_iter_first(hashmap *map,
-		hashmap_iter *iter)
+static inline void *hashmap_iter_first(struct hashmap *map,
+		struct hashmap_iter *iter)
 {
 	hashmap_iter_init(map, iter);
 	return hashmap_iter_next(iter);
diff --git a/test-hashmap.c b/test-hashmap.c
index a4b3440..de94c6d 100644
--- a/test-hashmap.c
+++ b/test-hashmap.c
@@ -2,113 +2,197 @@
 #include "hashmap.h"
 #include <stdio.h>
 
-typedef struct test_entry
+struct test_entry
 {
-	hashmap_entry ent;
+	struct hashmap_entry ent;
 	/* key and value as two \0-terminated strings */
 	char key[FLEX_ARRAY];
-} test_entry;
+};
 
-typedef struct test_key
+struct test_key
 {
-	hashmap_entry ent;
+	struct hashmap_entry ent;
 	char *key;
-} test_key;
+};
 
-static const char *get_key(const test_entry *e)
+static const char *get_key(const struct test_entry *e)
 {
-	return hashmap_entry_is_key(e) ? ((test_key*) e)->key : e->key;
+	return hashmap_entry_is_key(e) ? ((struct test_key*) e)->key : e->key;
 }
 
-static const char *get_value(const test_entry *e)
+static const char *get_value(const struct test_entry *e)
 {
 	return e->key + strlen(e->key) + 1;
 }
 
-static int test_entry_cmp(const test_entry *e1, const test_entry *e2)
+static int test_entry_cmp(const struct test_entry *e1,
+		const struct test_entry *e2)
 {
 	return strcmp(e1->key, get_key(e2));
 }
 
-static int test_entry_cmp_icase(const test_entry *e1, const test_entry *e2)
+static int test_entry_cmp_icase(const struct test_entry *e1,
+		const struct test_entry *e2)
 {
 	return strcasecmp(e1->key, get_key(e2));
 }
 
-static test_entry *alloc_test_entry(int hash, char *key, int klen, char *value,
-		int vlen)
+static struct test_entry *alloc_test_entry(int hash, char *key, int klen,
+		char *value, int vlen)
 {
-	test_entry *entry = malloc(sizeof(test_entry) + klen + vlen + 2);
+	struct test_entry *entry = malloc(sizeof(struct test_entry) + klen
+			+ vlen + 2);
 	hashmap_entry_init(entry, hash, 0);
 	memcpy(entry->key, key, klen + 1);
 	memcpy(entry->key + klen + 1, value, vlen + 1);
 	return entry;
 }
 
+#define HASH_METHOD_FNV 0
+#define HASH_METHOD_I 1
+#define HASH_METHOD_IDIV10 2
+#define HASH_METHOD_0 3
+#define HASH_METHOD_X2 4
+#define TEST_SPARSE 8
+#define TEST_ADD 16
+#define TEST_SIZE 100000
+
+static unsigned int hash(unsigned int method, unsigned int i, const char *key)
+{
+	unsigned int hash;
+	switch (method & 3)
+	{
+	case HASH_METHOD_FNV:
+		hash = strhash(key);
+		break;
+	case HASH_METHOD_I:
+		hash = i;
+		break;
+	case HASH_METHOD_IDIV10:
+		hash = i / 10;
+		break;
+	case HASH_METHOD_0:
+		hash = 0;
+		break;
+	}
+
+	if (method & HASH_METHOD_X2)
+		hash = 2 * hash;
+	return hash;
+}
+
 /*
  * Test insert performance of hashmap.[ch]
- * Usage: time echo "perfhashmap size rounds" | test-hashmap
+ * Usage: time echo "perfhashmap method rounds" | test-hashmap
  */
-static void perf_hashmap(unsigned int size, unsigned int rounds)
+static void perf_hashmap(unsigned int method, unsigned int rounds)
 {
-	hashmap map;
+	struct hashmap map;
 	char buf[16];
-	test_entry **entries;
+	struct test_entry **entries;
+	unsigned int *hashes;
 	unsigned int i, j;
 
-	entries = malloc(size * sizeof(test_entry*));
-	for (i = 0; i < size; i++) {
+	entries = malloc(TEST_SIZE * sizeof(struct test_entry*));
+	hashes = malloc(TEST_SIZE * sizeof(int));
+	for (i = 0; i < TEST_SIZE; i++) {
 		snprintf(buf, sizeof(buf), "%i", i);
 		entries[i] = alloc_test_entry(0, buf, strlen(buf), "", 0);
+		hashes[i] = hash(method, i, entries[i]->key);
 	}
 
-	for (j = 0; j < rounds; j++) {
-		// initialize the map
+	if (method & TEST_ADD) {
+		/* test adding to the map */
+		for (j = 0; j < rounds; j++) {
+			hashmap_init(&map, (hashmap_cmp_fn) test_entry_cmp, 0);
+
+			/* add entries */
+			for (i = 0; i < TEST_SIZE; i++) {
+				hashmap_entry_init(entries[i], hashes[i], 0);
+				hashmap_add(&map, entries[i]);
+			}
+
+			hashmap_free(&map, NULL);
+		}
+	} else {
+		/* test map lookups */
 		hashmap_init(&map, (hashmap_cmp_fn) test_entry_cmp, 0);
 
-		// add entries
-		for (i = 0; i < size; i++) {
-			unsigned int hash = strhash(entries[i]->key);
-			hashmap_entry_init(entries[i], hash, 0);
+		/* fill the map (sparsely if specified) */
+		j = (method & TEST_SPARSE) ? TEST_SIZE / 10 : TEST_SIZE;
+		for (i = 0; i < j; i++) {
+			hashmap_entry_init(entries[i], hashes[i], 0);
 			hashmap_add(&map, entries[i]);
 		}
 
+		for (j = 0; j < rounds; j++) {
+			for (i = 0; i < TEST_SIZE; i++) {
+				struct test_key k;
+				hashmap_entry_init(&k, hashes[i], 1);
+				k.key = entries[i]->key;
+				hashmap_get(&map, &k);
+			}
+		}
+
 		hashmap_free(&map, NULL);
 	}
 }
 
-typedef struct hash_entry
+struct hash_entry
 {
 	struct hash_entry *next;
 	char key[FLEX_ARRAY];
-} hash_entry;
+};
 
 /*
  * Test insert performance of hash.[ch]
- * Usage: time echo "perfhashtable size rounds" | test-hashmap
+ * Usage: time echo "perfhash method rounds" | test-hashmap
  */
-static void perf_hashtable(unsigned int size, unsigned int rounds)
+static void perf_hash(unsigned int method, unsigned int rounds)
 {
 	struct hash_table map;
 	char buf[16];
-	hash_entry **entries, **res;
+	struct hash_entry **entries, **res, *entry;
+	unsigned int *hashes;
 	unsigned int i, j;
 
-	entries = malloc(size * sizeof(hash_entry*));
-	for (i = 0; i < size; i++) {
+	entries = malloc(TEST_SIZE * sizeof(struct hash_entry*));
+	hashes = malloc(TEST_SIZE * sizeof(int));
+	for (i = 0; i < TEST_SIZE; i++) {
 		snprintf(buf, sizeof(buf), "%i", i);
-		entries[i] = malloc(sizeof(hash_entry) + strlen(buf) + 1);
+		entries[i] = malloc(sizeof(struct hash_entry) + strlen(buf) + 1);
 		strcpy(entries[i]->key, buf);
+		hashes[i] = hash(method, i, entries[i]->key);
 	}
 
-	for (j = 0; j < rounds; j++) {
-		// initialize the map
+	if (method & TEST_ADD) {
+		/* test adding to the map */
+		for (j = 0; j < rounds; j++) {
+			init_hash(&map);
+
+			/* add entries */
+			for (i = 0; i < TEST_SIZE; i++) {
+				res = (struct hash_entry**) insert_hash(
+						hashes[i], entries[i], &map);
+				if (res) {
+					entries[i]->next = *res;
+					*res = entries[i];
+				} else {
+					entries[i]->next = NULL;
+				}
+			}
+
+			free_hash(&map);
+		}
+	} else {
+		/* test map lookups */
 		init_hash(&map);
 
-		// add entries
-		for (i = 0; i < size; i++) {
-			unsigned int hash = strhash(entries[i]->key);
-			res = (hash_entry**) insert_hash(hash, entries[i], &map);
+		/* fill the map (sparsely if specified) */
+		j = (method & TEST_SPARSE) ? TEST_SIZE / 10 : TEST_SIZE;
+		for (i = 0; i < j; i++) {
+			res = (struct hash_entry**) insert_hash(hashes[i],
+					entries[i], &map);
 			if (res) {
 				entries[i]->next = *res;
 				*res = entries[i];
@@ -117,7 +201,19 @@ static void perf_hashtable(unsigned int size, unsigned int rounds)
 			}
 		}
 
+		for (j = 0; j < rounds; j++) {
+			for (i = 0; i < TEST_SIZE; i++) {
+				entry = lookup_hash(hashes[i], &map);
+				while (entry) {
+					if (!strcmp(entries[i]->key, entry->key))
+						break;
+					entry = entry->next;
+				}
+			}
+		}
+
 		free_hash(&map);
+
 	}
 }
 
@@ -133,13 +229,13 @@ static void perf_hashtable(unsigned int size, unsigned int rounds)
  * iterate -> key1 value1\nkey2 value2\n...
  * size -> tablesize numentries
  *
- * perfhashmap size rounds -> hashmap.[ch]: add <size> entries <rounds> times
- * perfhashtable size rounds -> hash.[ch]: add <size> entries <rounds> times
+ * perfhashmap method rounds -> test hashmap.[ch] performance
+ * perfhash method rounds -> test hash.[ch] performance
  */
 int main(int argc, char *argv[])
 {
 	char line[1024];
-	hashmap map;
+	struct hashmap map;
 	int icase;
 
 	/* init hash map */
@@ -151,7 +247,7 @@ int main(int argc, char *argv[])
 	while (fgets(line, sizeof(line), stdin)) {
 		char *cmd, *p1 = NULL, *p2 = NULL;
 		int l1 = 0, l2 = 0, hash = 0;
-		test_entry *entry;
+		struct test_entry *entry;
 
 		/* break line into command and up to two parameters */
 		cmd = strtok(line, DELIM);
@@ -197,7 +293,7 @@ int main(int argc, char *argv[])
 		} else if (!strcmp("get", cmd) && l1) {
 
 			/* setup static key */
-			test_key key;
+			struct test_key key;
 			hashmap_entry_init(&key, hash, 1);
 			key.key = p1;
 
@@ -215,7 +311,7 @@ int main(int argc, char *argv[])
 		} else if (!strcmp("remove", cmd) && l1) {
 
 			/* setup static key */
-			test_key key;
+			struct test_key key;
 			hashmap_entry_init(&key, hash, 1);
 			key.key = p1;
 
@@ -228,7 +324,7 @@ int main(int argc, char *argv[])
 
 		} else if (!strcmp("iterate", cmd)) {
 
-			hashmap_iter iter;
+			struct hashmap_iter iter;
 			hashmap_iter_init(&map, &iter);
 			while ((entry = hashmap_iter_next(&iter)))
 				printf("%s %s\n", get_key(entry), get_value(entry));
@@ -242,9 +338,9 @@ int main(int argc, char *argv[])
 
 			perf_hashmap(atoi(p1), atoi(p2));
 
-		} else if (!strcmp("perfhashtable", cmd) && l1 && l2) {
+		} else if (!strcmp("perfhash", cmd) && l1 && l2) {
 
-			perf_hashtable(atoi(p1), atoi(p2));
+			perf_hash(atoi(p1), atoi(p2));
 
 		} else {
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 1/5] add a hashtable implementation that supports O(1) removal
  2013-09-24  9:50 ` [PATCH v2 0/5] New hash table implementation Karsten Blees
@ 2013-09-24  9:51   ` Karsten Blees
  2013-09-24  9:52   ` [PATCH v2 2/5] buitin/describe.c: use new hash map implementation Karsten Blees
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-24  9:51 UTC (permalink / raw
  To: Git List

The existing hashtable implementation (in hash.[ch]) uses open addressing
(i.e. resolve hash collisions by distributing entries across the table).
Thus, removal is difficult to implement with less than O(n) complexity.
Resolving collisions of entries with identical hashes (e.g. via chaining)
is left to the client code.

Add a hashtable implementation that supports O(1) removal and is slightly
easier to use due to builtin entry chaining.

Supports all basic operations init, free, get, add, remove and iteration.

Also includes ready-to-use hash functions based on the public domain FNV-1
algorithm (http://www.isthe.com/chongo/tech/comp/fnv).

The per-entry data structure (hashmap_entry) is piggybacked in front of
the client's data structure to save memory. See test-hashmap.c for usage
examples.

The hashtable is resized by a factor of four when 80% full. With these
settings, average memory consumption is about 2/3 of hash.[ch], and
insertion is about twice as fast due to less frequent resizing.

Lookups are also slightly faster, because entries are strictly confined to
their bucket (i.e. no data of other buckets needs to be traversed).

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 Makefile           |   3 +
 hashmap.c          | 211 +++++++++++++++++++++++++++++++
 hashmap.h          | 200 ++++++++++++++++++++++++++++++
 t/t0011-hashmap.sh | 236 +++++++++++++++++++++++++++++++++++
 test-hashmap.c     | 354 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 1004 insertions(+)
 create mode 100644 hashmap.c
 create mode 100644 hashmap.h
 create mode 100755 t/t0011-hashmap.sh
 create mode 100644 test-hashmap.c

diff --git a/Makefile b/Makefile
index 3588ca1..e6ad011 100644
--- a/Makefile
+++ b/Makefile
@@ -562,6 +562,7 @@ TEST_PROGRAMS_NEED_X += test-date
 TEST_PROGRAMS_NEED_X += test-delta
 TEST_PROGRAMS_NEED_X += test-dump-cache-tree
 TEST_PROGRAMS_NEED_X += test-genrandom
+TEST_PROGRAMS_NEED_X += test-hashmap
 TEST_PROGRAMS_NEED_X += test-index-version
 TEST_PROGRAMS_NEED_X += test-line-buffer
 TEST_PROGRAMS_NEED_X += test-match-trees
@@ -681,6 +682,7 @@ LIB_H += gpg-interface.h
 LIB_H += graph.h
 LIB_H += grep.h
 LIB_H += hash.h
+LIB_H += hashmap.h
 LIB_H += help.h
 LIB_H += http.h
 LIB_H += kwset.h
@@ -811,6 +813,7 @@ LIB_OBJS += gpg-interface.o
 LIB_OBJS += graph.o
 LIB_OBJS += grep.o
 LIB_OBJS += hash.o
+LIB_OBJS += hashmap.o
 LIB_OBJS += help.o
 LIB_OBJS += hex.o
 LIB_OBJS += ident.o
diff --git a/hashmap.c b/hashmap.c
new file mode 100644
index 0000000..75a8578
--- /dev/null
+++ b/hashmap.c
@@ -0,0 +1,211 @@
+/*
+ * Generic implementation of hash-based key value mappings.
+ */
+#include "cache.h"
+#include "hashmap.h"
+
+#define FNV32_BASE ((unsigned int) 0x811c9dc5)
+#define FNV32_PRIME ((unsigned int) 0x01000193)
+
+unsigned int strhash(const char *str)
+{
+	unsigned int c, hash = FNV32_BASE;
+	while ((c = (unsigned char) *str++))
+		hash = (hash * FNV32_PRIME) ^ c;
+	return hash;
+}
+
+unsigned int strihash(const char *str)
+{
+	unsigned int c, hash = FNV32_BASE;
+	while ((c = (unsigned char) *str++)) {
+		if (c >= 'a' && c <= 'z')
+			c -= 'a' - 'A';
+		hash = (hash * FNV32_PRIME) ^ c;
+	}
+	return hash;
+}
+
+unsigned int memhash(const void *buf, size_t len)
+{
+	unsigned int hash = FNV32_BASE;
+	unsigned char *ucbuf = (unsigned char*) buf;
+	while (len--) {
+		unsigned int c = *ucbuf++;
+		hash = (hash * FNV32_PRIME) ^ c;
+	}
+	return hash;
+}
+
+unsigned int memihash(const void *buf, size_t len)
+{
+	unsigned int hash = FNV32_BASE;
+	unsigned char *ucbuf = (unsigned char*) buf;
+	while (len--) {
+		unsigned int c = *ucbuf++;
+		if (c >= 'a' && c <= 'z')
+			c -= 'a' - 'A';
+		hash = (hash * FNV32_PRIME) ^ c;
+	}
+	return hash;
+}
+
+#define HASHMAP_INITIAL_SIZE 64
+/* grow / shrink by 2^2 */
+#define HASHMAP_GROW 2
+/* grow if > 80% full (to 20%) */
+#define HASHMAP_GROW_AT 1.25
+/* shrink if < 16.6% full (to 66.6%) */
+#define HASHMAP_SHRINK_AT 6
+
+static inline int entry_equals(const struct hashmap *map,
+		const struct hashmap_entry *e1, const struct hashmap_entry *e2)
+{
+	return (e1 == e2) || (e1->hash == e2->hash && !(*map->cmpfn)(e1, e2));
+}
+
+static inline unsigned int bucket(const struct hashmap *map,
+		const struct hashmap_entry *key)
+{
+	return key->hash & (map->tablesize - 1);
+}
+
+static void rehash(struct hashmap *map, unsigned int newsize)
+{
+	unsigned int i, oldsize = map->tablesize;
+	struct hashmap_entry **oldtable = map->table;
+
+	map->tablesize = newsize;
+	map->table = xcalloc(sizeof(struct hashmap_entry*), map->tablesize);
+	for (i = 0; i < oldsize; i++) {
+		struct hashmap_entry *e = oldtable[i];
+		while (e) {
+			struct hashmap_entry *next = e->next;
+			unsigned int b = bucket(map, e);
+			e->next = map->table[b];
+			map->table[b] = e;
+			e = next;
+		}
+	}
+	free(oldtable);
+}
+
+static inline struct hashmap_entry **find_entry_ptr(const struct hashmap *map,
+		const struct hashmap_entry *key)
+{
+	struct hashmap_entry **e = &map->table[bucket(map, key)];
+	while (*e && !entry_equals(map, *e, key))
+		e = &(*e)->next;
+	return e;
+}
+
+static int always_equal(const void *unused1, const void *unused2)
+{
+	return 0;
+}
+
+void hashmap_init(struct hashmap *map, hashmap_cmp_fn equals_function,
+		size_t initial_size)
+{
+	map->size = 0;
+	map->cmpfn = equals_function ? equals_function : always_equal;
+	/* calculate initial table size and allocate the table */
+	map->tablesize = HASHMAP_INITIAL_SIZE;
+	initial_size *= HASHMAP_GROW_AT;
+	while (initial_size > map->tablesize)
+		map->tablesize <<= HASHMAP_GROW;
+	map->table = xcalloc(sizeof(struct hashmap_entry*), map->tablesize);
+}
+
+void hashmap_free(struct hashmap *map, hashmap_free_fn free_function)
+{
+	if (!map || !map->table)
+		return;
+	if (free_function) {
+		struct hashmap_iter iter;
+		struct hashmap_entry *e;
+		hashmap_iter_init(map, &iter);
+		while ((e = hashmap_iter_next(&iter)))
+			(*free_function)(e);
+	}
+	free(map->table);
+	memset(map, 0, sizeof(*map));
+}
+
+void *hashmap_get(const struct hashmap *map, const void *key)
+{
+	return *find_entry_ptr(map, key);
+}
+
+void *hashmap_get_next(const struct hashmap *map, const void *entry)
+{
+	struct hashmap_entry *e = ((struct hashmap_entry*) entry)->next;
+	for (; e; e = e->next)
+		if (entry_equals(map, entry, e))
+			return e;
+	return NULL;
+}
+
+void hashmap_add(struct hashmap *map, void *entry)
+{
+	unsigned int b = bucket(map, entry);
+
+	/* add entry */
+	((struct hashmap_entry*) entry)->next = map->table[b];
+	map->table[b] = entry;
+
+	/* fix size and rehash if appropriate */
+	map->size++;
+	if (map->size * HASHMAP_GROW_AT > map->tablesize)
+		rehash(map, map->tablesize << HASHMAP_GROW);
+}
+
+void *hashmap_remove(struct hashmap *map, const void *key)
+{
+	struct hashmap_entry *old;
+	struct hashmap_entry **e = find_entry_ptr(map, key);
+	if (!*e)
+		return NULL;
+
+	/* remove existing entry */
+	old = *e;
+	*e = old->next;
+	old->next = NULL;
+
+	/* fix size and rehash if appropriate */
+	map->size--;
+	if (map->tablesize > HASHMAP_INITIAL_SIZE &&
+	    map->size * HASHMAP_SHRINK_AT < map->tablesize)
+		rehash(map, map->tablesize >> HASHMAP_GROW);
+	return old;
+}
+
+void *hashmap_put(struct hashmap *map, void *entry)
+{
+	struct hashmap_entry *old = hashmap_remove(map, entry);
+	hashmap_add(map, entry);
+	return old;
+}
+
+void hashmap_iter_init(struct hashmap *map, struct hashmap_iter *iter)
+{
+	iter->map = map;
+	iter->tablepos = 0;
+	iter->next = NULL;
+}
+
+void *hashmap_iter_next(struct hashmap_iter *iter)
+{
+	struct hashmap_entry *current = iter->next;
+	for (;;) {
+		if (current) {
+			iter->next = current->next;
+			return current;
+		}
+
+		if (iter->tablepos >= iter->map->tablesize)
+			return NULL;
+
+		current = iter->map->table[iter->tablepos++];
+	}
+}
diff --git a/hashmap.h b/hashmap.h
new file mode 100644
index 0000000..98c4ebc
--- /dev/null
+++ b/hashmap.h
@@ -0,0 +1,200 @@
+#ifndef HASHMAP_H
+#define HASHMAP_H
+
+/*
+ * Generic implementation of hash-based key value mappings.
+ * Supports basic operations get, add/put, remove and iteration.
+ *
+ * Also contains a set of ready-to-use hash functions for strings, using the
+ * FNV-1 algorithm (see http://www.isthe.com/chongo/tech/comp/fnv).
+ */
+
+/*
+ * Case-sensitive FNV-1 hash of 0-terminated string.
+ * str: the string
+ * returns hash code
+ */
+extern unsigned int strhash(const char *buf);
+
+/*
+ * Case-insensitive FNV-1 hash of 0-terminated string.
+ * str: the string
+ * returns hash code
+ */
+extern unsigned int strihash(const char *buf);
+
+/*
+ * Case-sensitive FNV-1 hash of a memory block.
+ * buf: start of the memory block
+ * len: length of the memory block
+ * returns hash code
+ */
+extern unsigned int memhash(const void *buf, size_t len);
+
+/*
+ * Case-insensitive FNV-1 hash of a memory block.
+ * buf: start of the memory block
+ * len: length of the memory block
+ * returns hash code
+ */
+extern unsigned int memihash(const void *buf, size_t len);
+
+/*
+ * Hashmap entry data structure, must be used as first member of user data
+ * structures. Consists of a pointer and an int. Ideally it should be followed
+ * by an int-sized member to prevent unused memory on 64-bit systems due to
+ * alignment.
+ */
+struct hashmap_entry {
+	struct hashmap_entry *next;
+	unsigned int hash;
+};
+
+/*
+ * User-supplied function to test two hashmap entries for equality, shall
+ * return 0 if the entries are equal. This function is always called with
+ * non-NULL parameters that have the same hash code. When looking up an entry,
+ * the key parameter to hashmap_get and hashmap_remove is always passed as
+ * second argument.
+ */
+typedef int (*hashmap_cmp_fn)(const void *entry, const void *entry_or_key);
+
+/*
+ * User-supplied function to free a hashmap entry.
+ */
+typedef void (*hashmap_free_fn)(void *entry);
+
+/*
+ * Hashmap data structure, use with hashmap_* functions.
+ */
+struct hashmap {
+	struct hashmap_entry **table;
+	hashmap_cmp_fn cmpfn;
+	unsigned int size, tablesize;
+};
+
+/*
+ * Hashmap iterator data structure, use with hasmap_iter_* functions.
+ */
+struct hashmap_iter {
+	struct hashmap *map;
+	struct hashmap_entry *next;
+	unsigned int tablepos;
+};
+
+/*
+ * Initializes a hashmap_entry structure.
+ * entry: pointer to the entry to initialize
+ * hash: hash code of the entry
+ * key_only: true if entry is a key-only structure, see hashmap_entry_is_key
+ */
+static inline void hashmap_entry_init(void *entry, int hash, int key_only)
+{
+	struct hashmap_entry *e = entry;
+	e->hash = hash;
+	e->next = key_only ? (struct hashmap_entry*) -1 : NULL;
+}
+
+/*
+ * Checks if hashmap_entry was initialized with the key_only flag. This is
+ * useful if the entry structure is variable-sized (e.g. ending in a FLEX_ARRAY)
+ * and the key is part of the variable portion. To prevent dynamic allocation of
+ * a full-fledged entry structure for each lookup, a smaller, statically sized
+ * structure can be used as key (i.e. replacing the FLEX_ARRAY member with a
+ * char pointer). The hashmap_cmp_fn comparison function can then check whether
+ * entry_or_key is a full-fledged entry or a key-only structure.
+ * entry: pointer to the entry to check
+ * returns 0 for key-value entries and non-0 for key-only entries
+ */
+static inline int hashmap_entry_is_key(const void *entry)
+{
+	const struct hashmap_entry *e = entry;
+	return e->next == (struct hashmap_entry*) -1;
+}
+
+/*
+ * Initializes a hashmap structure.
+ * map: hashmap to initialize
+ * equals_function: optional function to test equality of hashmap entries. If
+ *  NULL, entries are considered equal if their hash codes are equal.
+ * initial_size: optional number of initial entries, 0 if unknown
+ */
+extern void hashmap_init(struct hashmap *map, hashmap_cmp_fn equals_function,
+		size_t initial_size);
+
+/*
+ * Frees a hashmap structure and allocated memory.
+ * map: hashmap to free
+ * free_function: optional function to free the hashmap entries
+ */
+extern void hashmap_free(struct hashmap *map, hashmap_free_fn free_function);
+
+/*
+ * Returns the hashmap entry for the specified key, or NULL if not found.
+ * map: the hashmap
+ * key: key of the entry to look up
+ * returns matching hashmap entry, or NULL if not found
+ */
+extern void *hashmap_get(const struct hashmap *map, const void *key);
+
+/*
+ * Returns the next equal hashmap entry if the map contains duplicates (see
+ * hashmap_add).
+ * map: the hashmap
+ * entry: current entry, obtained via hashmap_get or hashmap_get_next
+ * returns next equal hashmap entry, or NULL if not found
+ */
+extern void *hashmap_get_next(const struct hashmap *map, const void *entry);
+
+/*
+ * Adds a hashmap entry. This allows to add duplicate entries (i.e. separate
+ * values with the same key according to hashmap_cmp_fn).
+ * map: the hashmap
+ * entry: the entry to add
+ */
+extern void hashmap_add(struct hashmap *map, void *entry);
+
+/*
+ * Adds or replaces a hashmap entry.
+ * map: the hashmap
+ * entry: the entry to add or replace
+ * returns previous entry, or NULL if the entry is new
+ */
+extern void *hashmap_put(struct hashmap *map, void *entry);
+
+/*
+ * Removes a hashmap entry matching the specified key.
+ * map: the hashmap
+ * key: key of the entry to remove
+ * returns removed entry, or NULL if not found
+ */
+extern void *hashmap_remove(struct hashmap *map, const void *key);
+
+/*
+ * Initializes a hashmap iterator structure.
+ * map: the hashmap
+ * iter: hashmap iterator structure
+ */
+extern void hashmap_iter_init(struct hashmap *map, struct hashmap_iter *iter);
+
+/**
+ * Returns the next hashmap entry.
+ * iter: hashmap iterator
+ * returns next entry, or NULL if there are no more entries
+ */
+extern void *hashmap_iter_next(struct hashmap_iter *iter);
+
+/**
+ * Initializes a hashmap iterator and returns the first hashmap entry.
+ * map: the hashmap
+ * iter: hashmap iterator
+ * returns first entry, or NULL if there are no entries
+ */
+static inline void *hashmap_iter_first(struct hashmap *map,
+		struct hashmap_iter *iter)
+{
+	hashmap_iter_init(map, iter);
+	return hashmap_iter_next(iter);
+}
+
+#endif
diff --git a/t/t0011-hashmap.sh b/t/t0011-hashmap.sh
new file mode 100755
index 0000000..6c699d5
--- /dev/null
+++ b/t/t0011-hashmap.sh
@@ -0,0 +1,236 @@
+#!/bin/sh
+
+test_description='test hashmap and string hash functions'
+. ./test-lib.sh
+
+test_hashmap() {
+	echo "$1" | test-hashmap $3 > actual &&
+	echo "$2" > expect &&
+	test_cmp expect actual
+}
+
+test_expect_success 'hash functions' '
+
+test_hashmap "hash key1" "2215982743 2215982743 116372151 116372151" &&
+test_hashmap "hash key2" "2215982740 2215982740 116372148 116372148" &&
+test_hashmap "hash fooBarFrotz" "1383912807 1383912807 3189766727 3189766727" &&
+test_hashmap "hash foobarfrotz" "2862305959 2862305959 3189766727 3189766727"
+
+'
+
+test_expect_success 'put' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+put foobarfrotz value4
+size" "NULL
+NULL
+NULL
+NULL
+64 4"
+
+'
+
+test_expect_success 'put (case insensitive)' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+size" "NULL
+NULL
+NULL
+64 3" ignorecase
+
+'
+
+test_expect_success 'replace' '
+
+test_hashmap "put key1 value1
+put key1 value2
+put fooBarFrotz value3
+put fooBarFrotz value4
+size" "NULL
+value1
+NULL
+value3
+64 2"
+
+'
+
+test_expect_success 'replace (case insensitive)' '
+
+test_hashmap "put key1 value1
+put Key1 value2
+put fooBarFrotz value3
+put foobarfrotz value4
+size" "NULL
+value1
+NULL
+value3
+64 2" ignorecase
+
+'
+
+test_expect_success 'get' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+put foobarfrotz value4
+get key1
+get key2
+get fooBarFrotz
+get notInMap" "NULL
+NULL
+NULL
+NULL
+value1
+value2
+value3
+NULL"
+
+'
+
+test_expect_success 'get (case insensitive)' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+get Key1
+get keY2
+get foobarfrotz
+get notInMap" "NULL
+NULL
+NULL
+value1
+value2
+value3
+NULL" ignorecase
+
+'
+
+test_expect_success 'add' '
+
+test_hashmap "add key1 value1
+add key1 value2
+add fooBarFrotz value3
+add fooBarFrotz value4
+get key1
+get fooBarFrotz
+get notInMap" "value2
+value1
+value4
+value3
+NULL"
+
+'
+
+test_expect_success 'add (case insensitive)' '
+
+test_hashmap "add key1 value1
+add Key1 value2
+add fooBarFrotz value3
+add foobarfrotz value4
+get key1
+get Foobarfrotz
+get notInMap" "value2
+value1
+value4
+value3
+NULL" ignorecase
+
+'
+
+test_expect_success 'remove' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+remove key1
+remove key2
+remove notInMap
+size" "NULL
+NULL
+NULL
+value1
+value2
+NULL
+64 1"
+
+'
+
+test_expect_success 'remove (case insensitive)' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+remove Key1
+remove keY2
+remove notInMap
+size" "NULL
+NULL
+NULL
+value1
+value2
+NULL
+64 1" ignorecase
+
+'
+
+test_expect_success 'iterate' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+iterate" "NULL
+NULL
+NULL
+key2 value2
+key1 value1
+fooBarFrotz value3"
+
+'
+
+test_expect_success 'iterate (case insensitive)' '
+
+test_hashmap "put key1 value1
+put key2 value2
+put fooBarFrotz value3
+iterate" "NULL
+NULL
+NULL
+fooBarFrotz value3
+key2 value2
+key1 value1" ignorecase
+
+'
+
+test_expect_success 'grow / shrink' '
+
+	rm -f in &&
+	rm -f expect &&
+	for n in $(test_seq 51)
+	do
+		echo put key$n value$n >> in &&
+		echo NULL >> expect
+	done &&
+	echo size >> in &&
+	echo 64 51 >> expect &&
+	echo put key52 value52 >> in &&
+	echo NULL >> expect
+	echo size >> in &&
+	echo 256 52 >> expect &&
+	for n in $(test_seq 10)
+	do
+		echo remove key$n >> in &&
+		echo value$n >> expect
+	done &&
+	echo size >> in &&
+	echo 64 42 >> expect &&
+	cat in | test-hashmap > out &&
+	test_cmp expect out
+
+'
+
+test_done
diff --git a/test-hashmap.c b/test-hashmap.c
new file mode 100644
index 0000000..de94c6d
--- /dev/null
+++ b/test-hashmap.c
@@ -0,0 +1,354 @@
+#include "cache.h"
+#include "hashmap.h"
+#include <stdio.h>
+
+struct test_entry
+{
+	struct hashmap_entry ent;
+	/* key and value as two \0-terminated strings */
+	char key[FLEX_ARRAY];
+};
+
+struct test_key
+{
+	struct hashmap_entry ent;
+	char *key;
+};
+
+static const char *get_key(const struct test_entry *e)
+{
+	return hashmap_entry_is_key(e) ? ((struct test_key*) e)->key : e->key;
+}
+
+static const char *get_value(const struct test_entry *e)
+{
+	return e->key + strlen(e->key) + 1;
+}
+
+static int test_entry_cmp(const struct test_entry *e1,
+		const struct test_entry *e2)
+{
+	return strcmp(e1->key, get_key(e2));
+}
+
+static int test_entry_cmp_icase(const struct test_entry *e1,
+		const struct test_entry *e2)
+{
+	return strcasecmp(e1->key, get_key(e2));
+}
+
+static struct test_entry *alloc_test_entry(int hash, char *key, int klen,
+		char *value, int vlen)
+{
+	struct test_entry *entry = malloc(sizeof(struct test_entry) + klen
+			+ vlen + 2);
+	hashmap_entry_init(entry, hash, 0);
+	memcpy(entry->key, key, klen + 1);
+	memcpy(entry->key + klen + 1, value, vlen + 1);
+	return entry;
+}
+
+#define HASH_METHOD_FNV 0
+#define HASH_METHOD_I 1
+#define HASH_METHOD_IDIV10 2
+#define HASH_METHOD_0 3
+#define HASH_METHOD_X2 4
+#define TEST_SPARSE 8
+#define TEST_ADD 16
+#define TEST_SIZE 100000
+
+static unsigned int hash(unsigned int method, unsigned int i, const char *key)
+{
+	unsigned int hash;
+	switch (method & 3)
+	{
+	case HASH_METHOD_FNV:
+		hash = strhash(key);
+		break;
+	case HASH_METHOD_I:
+		hash = i;
+		break;
+	case HASH_METHOD_IDIV10:
+		hash = i / 10;
+		break;
+	case HASH_METHOD_0:
+		hash = 0;
+		break;
+	}
+
+	if (method & HASH_METHOD_X2)
+		hash = 2 * hash;
+	return hash;
+}
+
+/*
+ * Test insert performance of hashmap.[ch]
+ * Usage: time echo "perfhashmap method rounds" | test-hashmap
+ */
+static void perf_hashmap(unsigned int method, unsigned int rounds)
+{
+	struct hashmap map;
+	char buf[16];
+	struct test_entry **entries;
+	unsigned int *hashes;
+	unsigned int i, j;
+
+	entries = malloc(TEST_SIZE * sizeof(struct test_entry*));
+	hashes = malloc(TEST_SIZE * sizeof(int));
+	for (i = 0; i < TEST_SIZE; i++) {
+		snprintf(buf, sizeof(buf), "%i", i);
+		entries[i] = alloc_test_entry(0, buf, strlen(buf), "", 0);
+		hashes[i] = hash(method, i, entries[i]->key);
+	}
+
+	if (method & TEST_ADD) {
+		/* test adding to the map */
+		for (j = 0; j < rounds; j++) {
+			hashmap_init(&map, (hashmap_cmp_fn) test_entry_cmp, 0);
+
+			/* add entries */
+			for (i = 0; i < TEST_SIZE; i++) {
+				hashmap_entry_init(entries[i], hashes[i], 0);
+				hashmap_add(&map, entries[i]);
+			}
+
+			hashmap_free(&map, NULL);
+		}
+	} else {
+		/* test map lookups */
+		hashmap_init(&map, (hashmap_cmp_fn) test_entry_cmp, 0);
+
+		/* fill the map (sparsely if specified) */
+		j = (method & TEST_SPARSE) ? TEST_SIZE / 10 : TEST_SIZE;
+		for (i = 0; i < j; i++) {
+			hashmap_entry_init(entries[i], hashes[i], 0);
+			hashmap_add(&map, entries[i]);
+		}
+
+		for (j = 0; j < rounds; j++) {
+			for (i = 0; i < TEST_SIZE; i++) {
+				struct test_key k;
+				hashmap_entry_init(&k, hashes[i], 1);
+				k.key = entries[i]->key;
+				hashmap_get(&map, &k);
+			}
+		}
+
+		hashmap_free(&map, NULL);
+	}
+}
+
+struct hash_entry
+{
+	struct hash_entry *next;
+	char key[FLEX_ARRAY];
+};
+
+/*
+ * Test insert performance of hash.[ch]
+ * Usage: time echo "perfhash method rounds" | test-hashmap
+ */
+static void perf_hash(unsigned int method, unsigned int rounds)
+{
+	struct hash_table map;
+	char buf[16];
+	struct hash_entry **entries, **res, *entry;
+	unsigned int *hashes;
+	unsigned int i, j;
+
+	entries = malloc(TEST_SIZE * sizeof(struct hash_entry*));
+	hashes = malloc(TEST_SIZE * sizeof(int));
+	for (i = 0; i < TEST_SIZE; i++) {
+		snprintf(buf, sizeof(buf), "%i", i);
+		entries[i] = malloc(sizeof(struct hash_entry) + strlen(buf) + 1);
+		strcpy(entries[i]->key, buf);
+		hashes[i] = hash(method, i, entries[i]->key);
+	}
+
+	if (method & TEST_ADD) {
+		/* test adding to the map */
+		for (j = 0; j < rounds; j++) {
+			init_hash(&map);
+
+			/* add entries */
+			for (i = 0; i < TEST_SIZE; i++) {
+				res = (struct hash_entry**) insert_hash(
+						hashes[i], entries[i], &map);
+				if (res) {
+					entries[i]->next = *res;
+					*res = entries[i];
+				} else {
+					entries[i]->next = NULL;
+				}
+			}
+
+			free_hash(&map);
+		}
+	} else {
+		/* test map lookups */
+		init_hash(&map);
+
+		/* fill the map (sparsely if specified) */
+		j = (method & TEST_SPARSE) ? TEST_SIZE / 10 : TEST_SIZE;
+		for (i = 0; i < j; i++) {
+			res = (struct hash_entry**) insert_hash(hashes[i],
+					entries[i], &map);
+			if (res) {
+				entries[i]->next = *res;
+				*res = entries[i];
+			} else {
+				entries[i]->next = NULL;
+			}
+		}
+
+		for (j = 0; j < rounds; j++) {
+			for (i = 0; i < TEST_SIZE; i++) {
+				entry = lookup_hash(hashes[i], &map);
+				while (entry) {
+					if (!strcmp(entries[i]->key, entry->key))
+						break;
+					entry = entry->next;
+				}
+			}
+		}
+
+		free_hash(&map);
+
+	}
+}
+
+#define DELIM " \t\r\n"
+
+/*
+ * Read stdin line by line and print result of commands to stdout:
+ *
+ * hash key -> strhash(key) memhash(key) strihash(key) memihash(key)
+ * put key value -> NULL / old value
+ * get key -> NULL / value
+ * remove key -> NULL / old value
+ * iterate -> key1 value1\nkey2 value2\n...
+ * size -> tablesize numentries
+ *
+ * perfhashmap method rounds -> test hashmap.[ch] performance
+ * perfhash method rounds -> test hash.[ch] performance
+ */
+int main(int argc, char *argv[])
+{
+	char line[1024];
+	struct hashmap map;
+	int icase;
+
+	/* init hash map */
+	icase = argc > 1 && !strcmp("ignorecase", argv[1]);
+	hashmap_init(&map, (hashmap_cmp_fn) (icase ? test_entry_cmp_icase
+			: test_entry_cmp), 0);
+
+	/* process commands from stdin */
+	while (fgets(line, sizeof(line), stdin)) {
+		char *cmd, *p1 = NULL, *p2 = NULL;
+		int l1 = 0, l2 = 0, hash = 0;
+		struct test_entry *entry;
+
+		/* break line into command and up to two parameters */
+		cmd = strtok(line, DELIM);
+		/* ignore empty lines */
+		if (!cmd || *cmd == '#')
+			continue;
+
+		p1 = strtok(NULL, DELIM);
+		if (p1) {
+			l1 = strlen(p1);
+			hash = icase ? strihash(p1) : strhash(p1);
+			p2 = strtok(NULL, DELIM);
+			if (p2)
+				l2 = strlen(p2);
+		}
+
+		if (!strcmp("hash", cmd) && l1) {
+
+			/* print results of different hash functions */
+			printf("%u %u %u %u\n", strhash(p1), memhash(p1, l1),
+					strihash(p1), memihash(p1, l1));
+
+		} else if (!strcmp("add", cmd) && l1 && l2) {
+
+			/* create entry with key = p1, value = p2 */
+			entry = alloc_test_entry(hash, p1, l1, p2, l2);
+
+			/* add to hashmap */
+			hashmap_add(&map, entry);
+
+		} else if (!strcmp("put", cmd) && l1 && l2) {
+
+			/* create entry with key = p1, value = p2 */
+			entry = alloc_test_entry(hash, p1, l1, p2, l2);
+
+			/* add / replace entry */
+			entry = hashmap_put(&map, entry);
+
+			/* print and free replaced entry, if any */
+			puts(entry ? get_value(entry) : "NULL");
+			free(entry);
+
+		} else if (!strcmp("get", cmd) && l1) {
+
+			/* setup static key */
+			struct test_key key;
+			hashmap_entry_init(&key, hash, 1);
+			key.key = p1;
+
+			/* lookup entry in hashmap */
+			entry = hashmap_get(&map, &key);
+
+			/* print result */
+			if (!entry)
+				puts("NULL");
+			while (entry) {
+				puts(get_value(entry));
+				entry = hashmap_get_next(&map, entry);
+			}
+
+		} else if (!strcmp("remove", cmd) && l1) {
+
+			/* setup static key */
+			struct test_key key;
+			hashmap_entry_init(&key, hash, 1);
+			key.key = p1;
+
+			/* remove entry from hashmap */
+			entry = hashmap_remove(&map, &key);
+
+			/* print result and free entry*/
+			puts(entry ? get_value(entry) : "NULL");
+			free(entry);
+
+		} else if (!strcmp("iterate", cmd)) {
+
+			struct hashmap_iter iter;
+			hashmap_iter_init(&map, &iter);
+			while ((entry = hashmap_iter_next(&iter)))
+				printf("%s %s\n", get_key(entry), get_value(entry));
+
+		} else if (!strcmp("size", cmd)) {
+
+			/* print table sizes */
+			printf("%u %u\n", map.tablesize, map.size);
+
+		} else if (!strcmp("perfhashmap", cmd) && l1 && l2) {
+
+			perf_hashmap(atoi(p1), atoi(p2));
+
+		} else if (!strcmp("perfhash", cmd) && l1 && l2) {
+
+			perf_hash(atoi(p1), atoi(p2));
+
+		} else {
+
+			printf("Unknown command %s\n", cmd);
+
+		}
+	}
+
+	hashmap_free(&map, free);
+	return 0;
+}
-- 
1.8.4.5.gef01589.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 2/5] buitin/describe.c: use new hash map implementation
  2013-09-24  9:50 ` [PATCH v2 0/5] New hash table implementation Karsten Blees
  2013-09-24  9:51   ` [PATCH v2 1/5] add a hashtable implementation that supports O(1) removal Karsten Blees
@ 2013-09-24  9:52   ` Karsten Blees
  2013-09-24  9:52   ` [PATCH v2 3/5] diffcore-rename.c: move code around to prepare for the next patch Karsten Blees
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-24  9:52 UTC (permalink / raw
  To: Git List

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 builtin/describe.c | 53 ++++++++++++++++++++++++-----------------------------
 1 file changed, 24 insertions(+), 29 deletions(-)

diff --git a/builtin/describe.c b/builtin/describe.c
index 7d73722..5db5d89 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -6,7 +6,7 @@
 #include "exec_cmd.h"
 #include "parse-options.h"
 #include "diff.h"
-#include "hash.h"
+#include "hashmap.h"
 #include "argv-array.h"
 
 #define SEEN		(1u<<0)
@@ -25,7 +25,7 @@ static int longformat;
 static int first_parent;
 static int abbrev = -1; /* unspecified */
 static int max_candidates = 10;
-static struct hash_table names;
+static struct hashmap names;
 static int have_util;
 static const char *pattern;
 static int always;
@@ -38,7 +38,7 @@ static const char *diff_index_args[] = {
 
 
 struct commit_name {
-	struct commit_name *next;
+	struct hashmap_entry entry;
 	unsigned char peeled[20];
 	struct tag *tag;
 	unsigned prio:2; /* annotated tag = 2, tag = 1, head = 0 */
@@ -50,6 +50,11 @@ static const char *prio_names[] = {
 	"head", "lightweight", "annotated",
 };
 
+static int commit_name_cmp(struct commit_name *cn1, struct commit_name *cn2)
+{
+	return hashcmp(cn1->peeled, cn2->peeled);
+}
+
 static inline unsigned int hash_sha1(const unsigned char *sha1)
 {
 	unsigned int hash;
@@ -59,21 +64,10 @@ static inline unsigned int hash_sha1(const unsigned char *sha1)
 
 static inline struct commit_name *find_commit_name(const unsigned char *peeled)
 {
-	struct commit_name *n = lookup_hash(hash_sha1(peeled), &names);
-	while (n && !!hashcmp(peeled, n->peeled))
-		n = n->next;
-	return n;
-}
-
-static int set_util(void *chain, void *data)
-{
-	struct commit_name *n;
-	for (n = chain; n; n = n->next) {
-		struct commit *c = lookup_commit_reference_gently(n->peeled, 1);
-		if (c)
-			c->util = n;
-	}
-	return 0;
+	struct commit_name key;
+	hashmap_entry_init(&key, hash_sha1(peeled), 0);
+	hashcpy(key.peeled, peeled);
+	return hashmap_get(&names, &key);
 }
 
 static int replace_name(struct commit_name *e,
@@ -118,16 +112,10 @@ static void add_to_known_names(const char *path,
 	struct tag *tag = NULL;
 	if (replace_name(e, prio, sha1, &tag)) {
 		if (!e) {
-			void **pos;
 			e = xmalloc(sizeof(struct commit_name));
 			hashcpy(e->peeled, peeled);
-			pos = insert_hash(hash_sha1(peeled), e, &names);
-			if (pos) {
-				e->next = *pos;
-				*pos = e;
-			} else {
-				e->next = NULL;
-			}
+			hashmap_entry_init(e, hash_sha1(peeled), 0);
+			hashmap_add(&names, e);
 			e->path = NULL;
 		}
 		e->tag = tag;
@@ -292,7 +280,14 @@ static void describe(const char *arg, int last_one)
 		fprintf(stderr, _("searching to describe %s\n"), arg);
 
 	if (!have_util) {
-		for_each_hash(&names, set_util, NULL);
+		struct hashmap_iter iter;
+		struct commit *c;
+		struct commit_name *n = hashmap_iter_first(&names, &iter);
+		for (; n; n = hashmap_iter_next(&iter)) {
+			c = lookup_commit_reference_gently(n->peeled, 1);
+			if (c)
+				c->util = n;
+		}
 		have_util = 1;
 	}
 
@@ -463,9 +458,9 @@ int cmd_describe(int argc, const char **argv, const char *prefix)
 		return cmd_name_rev(args.argc, args.argv, prefix);
 	}
 
-	init_hash(&names);
+	hashmap_init(&names, (hashmap_cmp_fn) commit_name_cmp, 0);
 	for_each_rawref(get_name, NULL);
-	if (!names.nr && !always)
+	if (!names.size && !always)
 		die(_("No names found, cannot describe anything."));
 
 	if (argc == 0) {
-- 
1.8.4.5.gef01589.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 3/5] diffcore-rename.c: move code around to prepare for the next patch
  2013-09-24  9:50 ` [PATCH v2 0/5] New hash table implementation Karsten Blees
  2013-09-24  9:51   ` [PATCH v2 1/5] add a hashtable implementation that supports O(1) removal Karsten Blees
  2013-09-24  9:52   ` [PATCH v2 2/5] buitin/describe.c: use new hash map implementation Karsten Blees
@ 2013-09-24  9:52   ` Karsten Blees
  2013-09-24  9:53   ` [PATCH v2 4/5] diffcore-rename.c: simplify finding exact renames Karsten Blees
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-24  9:52 UTC (permalink / raw
  To: Git List

No actual code changes, just move hash_filespec up and outdent part of
find_identical_files.

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 diffcore-rename.c | 98 +++++++++++++++++++++++++++----------------------------
 1 file changed, 49 insertions(+), 49 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 6c7a72f..008a60c 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -248,6 +248,18 @@ struct file_similarity {
 	struct file_similarity *next;
 };
 
+static unsigned int hash_filespec(struct diff_filespec *filespec)
+{
+	unsigned int hash;
+	if (!filespec->sha1_valid) {
+		if (diff_populate_filespec(filespec, 0))
+			return 0;
+		hash_sha1_file(filespec->data, filespec->size, "blob", filespec->sha1);
+	}
+	memcpy(&hash, filespec->sha1, sizeof(hash));
+	return hash;
+}
+
 static int find_identical_files(struct file_similarity *src,
 				struct file_similarity *dst,
 				struct diff_options *options)
@@ -258,46 +270,46 @@ static int find_identical_files(struct file_similarity *src,
 	 * Walk over all the destinations ...
 	 */
 	do {
-		struct diff_filespec *target = dst->filespec;
-		struct file_similarity *p, *best;
-		int i = 100, best_score = -1;
-
-		/*
-		 * .. to find the best source match
-		 */
-		best = NULL;
-		for (p = src; p; p = p->next) {
-			int score;
-			struct diff_filespec *source = p->filespec;
-
-			/* False hash collision? */
-			if (hashcmp(source->sha1, target->sha1))
-				continue;
-			/* Non-regular files? If so, the modes must match! */
-			if (!S_ISREG(source->mode) || !S_ISREG(target->mode)) {
-				if (source->mode != target->mode)
-					continue;
-			}
-			/* Give higher scores to sources that haven't been used already */
-			score = !source->rename_used;
-			if (source->rename_used && options->detect_rename != DIFF_DETECT_COPY)
-				continue;
-			score += basename_same(source, target);
-			if (score > best_score) {
-				best = p;
-				best_score = score;
-				if (score == 2)
-					break;
-			}
+	struct diff_filespec *target = dst->filespec;
+	struct file_similarity *p, *best;
+	int i = 100, best_score = -1;
 
-			/* Too many identical alternatives? Pick one */
-			if (!--i)
-				break;
+	/*
+	 * .. to find the best source match
+	 */
+	best = NULL;
+	for (p = src; p; p = p->next) {
+		int score;
+		struct diff_filespec *source = p->filespec;
+
+		/* False hash collision? */
+		if (hashcmp(source->sha1, target->sha1))
+			continue;
+		/* Non-regular files? If so, the modes must match! */
+		if (!S_ISREG(source->mode) || !S_ISREG(target->mode)) {
+			if (source->mode != target->mode)
+				continue;
 		}
-		if (best) {
-			record_rename_pair(dst->index, best->index, MAX_SCORE);
-			renames++;
+		/* Give higher scores to sources that haven't been used already */
+		score = !source->rename_used;
+		if (source->rename_used && options->detect_rename != DIFF_DETECT_COPY)
+			continue;
+		score += basename_same(source, target);
+		if (score > best_score) {
+			best = p;
+			best_score = score;
+			if (score == 2)
+				break;
 		}
+
+		/* Too many identical alternatives? Pick one */
+		if (!--i)
+			break;
+	}
+	if (best) {
+		record_rename_pair(dst->index, best->index, MAX_SCORE);
+		renames++;
+	}
 	} while ((dst = dst->next) != NULL);
 	return renames;
 }
@@ -343,18 +355,6 @@ static int find_same_files(void *ptr, void *data)
 	return ret;
 }
 
-static unsigned int hash_filespec(struct diff_filespec *filespec)
-{
-	unsigned int hash;
-	if (!filespec->sha1_valid) {
-		if (diff_populate_filespec(filespec, 0))
-			return 0;
-		hash_sha1_file(filespec->data, filespec->size, "blob", filespec->sha1);
-	}
-	memcpy(&hash, filespec->sha1, sizeof(hash));
-	return hash;
-}
-
 static void insert_file_table(struct hash_table *table, int src_dst, int index, struct diff_filespec *filespec)
 {
 	void **pos;
-- 
1.8.4.5.gef01589.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 4/5] diffcore-rename.c: simplify finding exact renames
  2013-09-24  9:50 ` [PATCH v2 0/5] New hash table implementation Karsten Blees
                     ` (2 preceding siblings ...)
  2013-09-24  9:52   ` [PATCH v2 3/5] diffcore-rename.c: move code around to prepare for the next patch Karsten Blees
@ 2013-09-24  9:53   ` Karsten Blees
  2013-09-24  9:54   ` [PATCH v2 5/5] diffcore-rename.c: use new hash map implementation Karsten Blees
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-24  9:53 UTC (permalink / raw
  To: Git List

The find_exact_renames function currently only uses the hash table for
grouping, i.e.:

1. add sources
2. add destinations
3. iterate all buckets, per bucket:
4. split sources from destinations
5. iterate destinations, per destination:
6. iterate sources to find best match

This can be simplified by utilizing the lookup functionality of the hash
table, i.e.:

1. add sources
2. iterate destinations, per destination:
3. lookup sources matching the current destination
4. iterate sources to find best match

This saves several iterations and file_similarity allocations for the
destinations.

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 diffcore-rename.c | 75 +++++++++++++++----------------------------------------
 1 file changed, 20 insertions(+), 55 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 008a60c..82b7975 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -243,7 +243,7 @@ static int score_compare(const void *a_, const void *b_)
 }
 
 struct file_similarity {
-	int src_dst, index;
+	int index;
 	struct diff_filespec *filespec;
 	struct file_similarity *next;
 };
@@ -260,25 +260,21 @@ static unsigned int hash_filespec(struct diff_filespec *filespec)
 	return hash;
 }
 
-static int find_identical_files(struct file_similarity *src,
-				struct file_similarity *dst,
+static int find_identical_files(struct hash_table *srcs,
+				int dst_index,
 				struct diff_options *options)
 {
 	int renames = 0;
 
-	/*
-	 * Walk over all the destinations ...
-	 */
-	do {
-	struct diff_filespec *target = dst->filespec;
+	struct diff_filespec *target = rename_dst[dst_index].two;
 	struct file_similarity *p, *best;
 	int i = 100, best_score = -1;
 
 	/*
-	 * .. to find the best source match
+	 * Find the best source match for specified destination.
 	 */
 	best = NULL;
-	for (p = src; p; p = p->next) {
+	for (p = lookup_hash(hash_filespec(target), srcs); p; p = p->next) {
 		int score;
 		struct diff_filespec *source = p->filespec;
 
@@ -307,61 +303,28 @@ static int find_identical_files(struct file_similarity *src,
 			break;
 	}
 	if (best) {
-		record_rename_pair(dst->index, best->index, MAX_SCORE);
+		record_rename_pair(dst_index, best->index, MAX_SCORE);
 		renames++;
 	}
-	} while ((dst = dst->next) != NULL);
 	return renames;
 }
 
-static void free_similarity_list(struct file_similarity *p)
+static int free_similarity_list(void *p, void *unused)
 {
 	while (p) {
 		struct file_similarity *entry = p;
-		p = p->next;
+		p = entry->next;
 		free(entry);
 	}
+	return 0;
 }
 
-static int find_same_files(void *ptr, void *data)
-{
-	int ret;
-	struct file_similarity *p = ptr;
-	struct file_similarity *src = NULL, *dst = NULL;
-	struct diff_options *options = data;
-
-	/* Split the hash list up into sources and destinations */
-	do {
-		struct file_similarity *entry = p;
-		p = p->next;
-		if (entry->src_dst < 0) {
-			entry->next = src;
-			src = entry;
-		} else {
-			entry->next = dst;
-			dst = entry;
-		}
-	} while (p);
-
-	/*
-	 * If we have both sources *and* destinations, see if
-	 * we can match them up
-	 */
-	ret = (src && dst) ? find_identical_files(src, dst, options) : 0;
-
-	/* Free the hashes and return the number of renames found */
-	free_similarity_list(src);
-	free_similarity_list(dst);
-	return ret;
-}
-
-static void insert_file_table(struct hash_table *table, int src_dst, int index, struct diff_filespec *filespec)
+static void insert_file_table(struct hash_table *table, int index, struct diff_filespec *filespec)
 {
 	void **pos;
 	unsigned int hash;
 	struct file_similarity *entry = xmalloc(sizeof(*entry));
 
-	entry->src_dst = src_dst;
 	entry->index = index;
 	entry->filespec = filespec;
 	entry->next = NULL;
@@ -385,24 +348,26 @@ static void insert_file_table(struct hash_table *table, int src_dst, int index,
  */
 static int find_exact_renames(struct diff_options *options)
 {
-	int i;
+	int i, renames;
 	struct hash_table file_table;
 
+	/* Add all sources to the hash table */
 	init_hash(&file_table);
-	preallocate_hash(&file_table, rename_src_nr + rename_dst_nr);
+	preallocate_hash(&file_table, rename_src_nr);
 	for (i = 0; i < rename_src_nr; i++)
-		insert_file_table(&file_table, -1, i, rename_src[i].p->one);
+		insert_file_table(&file_table, i, rename_src[i].p->one);
 
+	/* Walk the destinations and find best source match */
 	for (i = 0; i < rename_dst_nr; i++)
-		insert_file_table(&file_table, 1, i, rename_dst[i].two);
+		renames += find_identical_files(&file_table, i, options);
 
-	/* Find the renames */
-	i = for_each_hash(&file_table, find_same_files, options);
+	/* Free source file_similarity chains */
+	for_each_hash(&file_table, free_similarity_list, options);
 
 	/* .. and free the hash data structure */
 	free_hash(&file_table);
 
-	return i;
+	return renames;
 }
 
 #define NUM_CANDIDATE_PER_DST 4
-- 
1.8.4.5.gef01589.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 5/5] diffcore-rename.c: use new hash map implementation
  2013-09-24  9:50 ` [PATCH v2 0/5] New hash table implementation Karsten Blees
                     ` (3 preceding siblings ...)
  2013-09-24  9:53   ` [PATCH v2 4/5] diffcore-rename.c: simplify finding exact renames Karsten Blees
@ 2013-09-24  9:54   ` Karsten Blees
  2013-09-24 10:18   ` [PATCH v2 0/5] New hash table implementation Fredrik Gustafsson
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-24  9:54 UTC (permalink / raw
  To: Git List

Signed-off-by: Karsten Blees <blees@dcon.de>
---
 diffcore-rename.c | 48 +++++++++++++-----------------------------------
 1 file changed, 13 insertions(+), 35 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 82b7975..2e70d31 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -4,7 +4,7 @@
 #include "cache.h"
 #include "diff.h"
 #include "diffcore.h"
-#include "hash.h"
+#include "hashmap.h"
 #include "progress.h"
 
 /* Table of rename/copy destinations */
@@ -243,9 +243,9 @@ static int score_compare(const void *a_, const void *b_)
 }
 
 struct file_similarity {
+	struct hashmap_entry entry;
 	int index;
 	struct diff_filespec *filespec;
-	struct file_similarity *next;
 };
 
 static unsigned int hash_filespec(struct diff_filespec *filespec)
@@ -260,21 +260,22 @@ static unsigned int hash_filespec(struct diff_filespec *filespec)
 	return hash;
 }
 
-static int find_identical_files(struct hash_table *srcs,
+static int find_identical_files(struct hashmap *srcs,
 				int dst_index,
 				struct diff_options *options)
 {
 	int renames = 0;
 
 	struct diff_filespec *target = rename_dst[dst_index].two;
-	struct file_similarity *p, *best;
+	struct file_similarity *p, *best, dst;
 	int i = 100, best_score = -1;
 
 	/*
 	 * Find the best source match for specified destination.
 	 */
 	best = NULL;
-	for (p = lookup_hash(hash_filespec(target), srcs); p; p = p->next) {
+	hashmap_entry_init(&dst, hash_filespec(target), 0);
+	for (p = hashmap_get(srcs, &dst); p; p = hashmap_get_next(srcs, p)) {
 		int score;
 		struct diff_filespec *source = p->filespec;
 
@@ -309,34 +310,15 @@ static int find_identical_files(struct hash_table *srcs,
 	return renames;
 }
 
-static int free_similarity_list(void *p, void *unused)
+static void insert_file_table(struct hashmap *table, int index, struct diff_filespec *filespec)
 {
-	while (p) {
-		struct file_similarity *entry = p;
-		p = entry->next;
-		free(entry);
-	}
-	return 0;
-}
-
-static void insert_file_table(struct hash_table *table, int index, struct diff_filespec *filespec)
-{
-	void **pos;
-	unsigned int hash;
 	struct file_similarity *entry = xmalloc(sizeof(*entry));
 
 	entry->index = index;
 	entry->filespec = filespec;
-	entry->next = NULL;
-
-	hash = hash_filespec(filespec);
-	pos = insert_hash(hash, entry, table);
 
-	/* We already had an entry there? */
-	if (pos) {
-		entry->next = *pos;
-		*pos = entry;
-	}
+	hashmap_entry_init(entry, hash_filespec(filespec), 0);
+	hashmap_add(table, entry);
 }
 
 /*
@@ -349,11 +331,10 @@ static void insert_file_table(struct hash_table *table, int index, struct diff_f
 static int find_exact_renames(struct diff_options *options)
 {
 	int i, renames;
-	struct hash_table file_table;
+	struct hashmap file_table;
 
 	/* Add all sources to the hash table */
-	init_hash(&file_table);
-	preallocate_hash(&file_table, rename_src_nr);
+	hashmap_init(&file_table, NULL, rename_src_nr);
 	for (i = 0; i < rename_src_nr; i++)
 		insert_file_table(&file_table, i, rename_src[i].p->one);
 
@@ -361,11 +342,8 @@ static int find_exact_renames(struct diff_options *options)
 	for (i = 0; i < rename_dst_nr; i++)
 		renames += find_identical_files(&file_table, i, options);
 
-	/* Free source file_similarity chains */
-	for_each_hash(&file_table, free_similarity_list, options);
-
-	/* .. and free the hash data structure */
-	free_hash(&file_table);
+	/* Free the hash data structure and entries */
+	hashmap_free(&file_table, free);
 
 	return renames;
 }
-- 
1.8.4.5.gef01589.dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/5] New hash table implementation
  2013-09-24  9:50 ` [PATCH v2 0/5] New hash table implementation Karsten Blees
                     ` (4 preceding siblings ...)
  2013-09-24  9:54   ` [PATCH v2 5/5] diffcore-rename.c: use new hash map implementation Karsten Blees
@ 2013-09-24 10:18   ` Fredrik Gustafsson
  2013-09-24 11:16   ` Tay Ray Chuan
  2013-09-26 10:16   ` Fredrik Gustafsson
  7 siblings, 0 replies; 24+ messages in thread
From: Fredrik Gustafsson @ 2013-09-24 10:18 UTC (permalink / raw
  To: Karsten Blees; +Cc: Git List

On Tue, Sep 24, 2013 at 11:50:16AM +0200, Karsten Blees wrote:
> Regarding performance, I have to admit that the difference between the two implementations is far greater than I had anticipated. The following times (in seconds) are from Linux x64 (Debian Sarge) on a Core i7 860 @2.8GHz. All tests have been run with 1,000 rounds of 100,000 entries each.
> 
> The 'get 10% hits' test does 100,000 lookups on a table with 10,000 entries (i.e. 90% unsuccessful lookups).
> 
> The rows denote different hash functions with different qualities:
> - FNV: FNV-1 hash on stringified loop counter (i.e. fnv1(itoa(i))), as
>   an example of a high quality / low collision hash
> - i: just the loop counter (i.e. 0, 1, 2,...), guaranteed collision free
> - i/10: every 10 entries share the same hash code, lots of collisions
> 
> The i and i/10 tests show that open addressing suffers badly from clustering, i.e. with adjacent hash codes, it degrades to linear search. The *2 versions provide for some space between used buckets to better compare it to the chaining version.
> 
> 
>         |       add        |  get 100% hits  |    get 10% hits
>         |  hash  | hashmap | hash  | hashmap |  hash   | hashmap
> --------+--------+---------+-------+---------+---------+--------
> FNV     | 14.815 |   2.345 | 3.059 |   1.642 |   4.085 |   0.976
> FNV  x2 | 14.409 |   2.706 | 2.888 |   1.959 |   3.905 |   1.393
> i       |  7.432 |   1.593 | 1.364 |   1.142 | 413.023 |   0.589
> i    x2 |  9.169 |   1.866 | 1.427 |   1.163 |   0.757 |   0.670
> i/10    |  1.800 |   1.555 | 5.365 |   6.465 |  32.918 |   1.052
> i/10 x2 |  1.892 |   1.555 | 5.386 |   6.474 |   1.123 |   1.206
> 
> Tests can be reproduced with 'time echo "perfhash[map] <method> 1000" | ./test-hashmap', see test-hashmap.c for definition of method flags.
> 

So I did this improved hash implementation a few months back. Although I
could do a test like this and see an improvement, I failed to see an
improvement in actual git usage.

Hopefully it was just me doing something wrong, but I abandonned the
idea of a better hashmap since I couldn't see any major performance
boost using git and the current implementation is really simple and easy
to maintain.

So my question to you is, does your hashmap speed up git? And does it
speed it up enough to justify that your implementation is the double
amount of code than the current?
-- 
Med vänliga hälsningar
Fredrik Gustafsson

tel: 0733-608274
e-post: iveqy@iveqy.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/5] New hash table implementation
  2013-09-24  9:50 ` [PATCH v2 0/5] New hash table implementation Karsten Blees
                     ` (5 preceding siblings ...)
  2013-09-24 10:18   ` [PATCH v2 0/5] New hash table implementation Fredrik Gustafsson
@ 2013-09-24 11:16   ` Tay Ray Chuan
  2013-09-26 14:38     ` Karsten Blees
  2013-09-26 10:16   ` Fredrik Gustafsson
  7 siblings, 1 reply; 24+ messages in thread
From: Tay Ray Chuan @ 2013-09-24 11:16 UTC (permalink / raw
  To: Karsten Blees; +Cc: Git List

Hi Karsten,

On Tue, Sep 24, 2013 at 5:50 PM, Karsten Blees <karsten.blees@gmail.com> wrote:
>
>         |       add        |  get 100% hits  |    get 10% hits
>         |  hash  | hashmap | hash  | hashmap |  hash   | hashmap
> --------+--------+---------+-------+---------+---------+--------
> FNV     | 14.815 |   2.345 | 3.059 |   1.642 |   4.085 |   0.976
> FNV  x2 | 14.409 |   2.706 | 2.888 |   1.959 |   3.905 |   1.393
> i       |  7.432 |   1.593 | 1.364 |   1.142 | 413.023 |   0.589
> i    x2 |  9.169 |   1.866 | 1.427 |   1.163 |   0.757 |   0.670
> i/10    |  1.800 |   1.555 | 5.365 |   6.465 |  32.918 |   1.052
> i/10 x2 |  1.892 |   1.555 | 5.386 |   6.474 |   1.123 |   1.206
>
> Tests can be reproduced with 'time echo "perfhash[map] <method> 1000" | ./test-hashmap', see test-hashmap.c for definition of method flags.

I'm not sure if I'm reading the numbers right, but they look impressive!

If it's not too much trouble, could you put together an API document,
along the lines of Documentation/technical/api-hash.txt? I could give
a stab at replacing patience and histogram diff's hash implementation
with yours.

-- 
Cheers,
Ray Chuan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/5] New hash table implementation
  2013-09-24  9:50 ` [PATCH v2 0/5] New hash table implementation Karsten Blees
                     ` (6 preceding siblings ...)
  2013-09-24 11:16   ` Tay Ray Chuan
@ 2013-09-26 10:16   ` Fredrik Gustafsson
  2013-09-26 10:26     ` Duy Nguyen
  2013-09-26 13:55     ` Karsten Blees
  7 siblings, 2 replies; 24+ messages in thread
From: Fredrik Gustafsson @ 2013-09-26 10:16 UTC (permalink / raw
  To: Karsten Blees; +Cc: Git List

On Tue, Sep 24, 2013 at 11:50:16AM +0200, Karsten Blees wrote:
> Tests can be reproduced with 'time echo "perfhash[map] <method> 1000" | ./test-hashmap', see test-hashmap.c for definition of method flags.

So I'm still curious about the actual performance improvements for git.
I runned git describe on the linux kernel with both the old hashmap and
this new one:

With old hashmap
================
iveqy@minilla:/srv/slask/linux$ time ../git/git describe
v3.12-rc2-83-g4b97280

real    0m0.236s
user    0m0.216s
sys     0m0.020s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe
v3.12-rc2-83-g4b97280

real    0m0.236s
user    0m0.220s
sys     0m0.016s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe
v3.12-rc2-83-g4b97280

real    0m0.236s
user    0m0.212s
sys     0m0.024s

With new hashmap
================
iveqy@minilla:/srv/slask/linux$ time ../git/git describe
v3.12-rc2-83-g4b97280

real    0m0.236s
user    0m0.216s
sys     0m0.020s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe
v3.12-rc2-83-g4b97280

real    0m0.235s
user    0m0.216s
sys     0m0.020s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe
v3.12-rc2-83-g4b97280

real    0m0.235s
user    0m0.208s
sys     0m0.028s


I can't see any improvements at all here. What do I miss? Am I running
git describe in the wrong way? Does linux.git have too few tags to be
important?

-- 
Med vänliga hälsningar
Fredrik Gustafsson

tel: 0733-608274
e-post: iveqy@iveqy.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/5] New hash table implementation
  2013-09-26 10:16   ` Fredrik Gustafsson
@ 2013-09-26 10:26     ` Duy Nguyen
  2013-09-26 11:08       ` Fredrik Gustafsson
  2013-09-26 13:55     ` Karsten Blees
  1 sibling, 1 reply; 24+ messages in thread
From: Duy Nguyen @ 2013-09-26 10:26 UTC (permalink / raw
  To: Fredrik Gustafsson; +Cc: Karsten Blees, Git List

On Thu, Sep 26, 2013 at 5:16 PM, Fredrik Gustafsson <iveqy@iveqy.com> wrote:
> On Tue, Sep 24, 2013 at 11:50:16AM +0200, Karsten Blees wrote:
>> Tests can be reproduced with 'time echo "perfhash[map] <method> 1000" | ./test-hashmap', see test-hashmap.c for definition of method flags.
>
> So I'm still curious about the actual performance improvements for git.
> I runned git describe on the linux kernel with both the old hashmap and
> this new one:
>
> ...
>
> I can't see any improvements at all here. What do I miss? Am I running
> git describe in the wrong way? Does linux.git have too few tags to be
> important?

I wonder if it makes any difference if there are a lot more refs. I
hear gerrit creates a lot but don't know how many. linux-2.6 has ~350
refs. How about increasing the number of refs to 3500 refs?
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/5] New hash table implementation
  2013-09-26 10:26     ` Duy Nguyen
@ 2013-09-26 11:08       ` Fredrik Gustafsson
  2013-09-26 11:14         ` Duy Nguyen
  0 siblings, 1 reply; 24+ messages in thread
From: Fredrik Gustafsson @ 2013-09-26 11:08 UTC (permalink / raw
  To: Duy Nguyen; +Cc: Fredrik Gustafsson, Karsten Blees, Git List

On Thu, Sep 26, 2013 at 05:26:27PM +0700, Duy Nguyen wrote:
> On Thu, Sep 26, 2013 at 5:16 PM, Fredrik Gustafsson <iveqy@iveqy.com> wrote:
> > On Tue, Sep 24, 2013 at 11:50:16AM +0200, Karsten Blees wrote:
> >> Tests can be reproduced with 'time echo "perfhash[map] <method> 1000" | ./test-hashmap', see test-hashmap.c for definition of method flags.
> >
> > So I'm still curious about the actual performance improvements for git.
> > I runned git describe on the linux kernel with both the old hashmap and
> > this new one:
> >
> > ...
> >
> > I can't see any improvements at all here. What do I miss? Am I running
> > git describe in the wrong way? Does linux.git have too few tags to be
> > important?
> 
> I wonder if it makes any difference if there are a lot more refs. I
> hear gerrit creates a lot but don't know how many. linux-2.6 has ~350
> refs. How about increasing the number of refs to 3500 refs?

So I runned:
for i in $(git rev-list HEAD ); do git tag "tag$i" $i ; done

in my linux repo and aborted it after a while:
iveqy@minilla:/srv/slask/linux$ git tag | wc -l
9323

So it's a few at least. Not sure how those artificial tagnames would
hurt or improve the performance.

Old hashtable
=============
iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD 
v3.12-rc2-83-g4b97280

real    0m0.384s
user    0m0.288s
sys     0m0.092s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD 
v3.12-rc2-83-g4b97280

real    0m0.383s
user    0m0.284s
sys     0m0.100s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD 
v3.12-rc2-83-g4b97280

real    0m0.386s
user    0m0.312s
sys     0m0.072s


New hashtable
=============
iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD 
v3.12-rc2-83-g4b97280

real    0m0.382s
user    0m0.300s
sys     0m0.084s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD 
v3.12-rc2-83-g4b97280

real    0m0.382s
user    0m0.288s
sys     0m0.092s
iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD 
v3.12-rc2-83-g4b97280

real    0m0.384s
user    0m0.296s
sys     0m0.088s


-- 
Med vänliga hälsningar
Fredrik Gustafsson

tel: 0733-608274
e-post: iveqy@iveqy.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/5] New hash table implementation
  2013-09-26 11:08       ` Fredrik Gustafsson
@ 2013-09-26 11:14         ` Duy Nguyen
  0 siblings, 0 replies; 24+ messages in thread
From: Duy Nguyen @ 2013-09-26 11:14 UTC (permalink / raw
  To: Fredrik Gustafsson; +Cc: Karsten Blees, Git List

On Thu, Sep 26, 2013 at 6:08 PM, Fredrik Gustafsson <iveqy@iveqy.com> wrote:
> On Thu, Sep 26, 2013 at 05:26:27PM +0700, Duy Nguyen wrote:
>> On Thu, Sep 26, 2013 at 5:16 PM, Fredrik Gustafsson <iveqy@iveqy.com> wrote:
>> > On Tue, Sep 24, 2013 at 11:50:16AM +0200, Karsten Blees wrote:
>> >> Tests can be reproduced with 'time echo "perfhash[map] <method> 1000" | ./test-hashmap', see test-hashmap.c for definition of method flags.
>> >
>> > So I'm still curious about the actual performance improvements for git.
>> > I runned git describe on the linux kernel with both the old hashmap and
>> > this new one:
>> >
>> > ...
>> >
>> > I can't see any improvements at all here. What do I miss? Am I running
>> > git describe in the wrong way? Does linux.git have too few tags to be
>> > important?
>>
>> I wonder if it makes any difference if there are a lot more refs. I
>> hear gerrit creates a lot but don't know how many. linux-2.6 has ~350
>> refs. How about increasing the number of refs to 3500 refs?
>
> So I runned:
> for i in $(git rev-list HEAD ); do git tag "tag$i" $i ; done
>
> in my linux repo and aborted it after a while:
> iveqy@minilla:/srv/slask/linux$ git tag | wc -l
> 9323
>
> So it's a few at least. Not sure how those artificial tagnames would
> hurt or improve the performance.
>
> Old hashtable
> =============
> iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD
> v3.12-rc2-83-g4b97280
>
> real    0m0.384s
> user    0m0.288s
> sys     0m0.092s
> iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD
> v3.12-rc2-83-g4b97280
>
> real    0m0.383s
> user    0m0.284s
> sys     0m0.100s
> iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD
> v3.12-rc2-83-g4b97280
>
> real    0m0.386s
> user    0m0.312s
> sys     0m0.072s
>
>
> New hashtable
> =============
> iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD
> v3.12-rc2-83-g4b97280
>
> real    0m0.382s
> user    0m0.300s
> sys     0m0.084s
> iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD
> v3.12-rc2-83-g4b97280
>
> real    0m0.382s
> user    0m0.288s
> sys     0m0.092s
> iveqy@minilla:/srv/slask/linux$ time ../git/git describe HEAD
> v3.12-rc2-83-g4b97280
>
> real    0m0.384s
> user    0m0.296s
> sys     0m0.088s

OK I have to say I don't see any justification for more code then.
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/5] New hash table implementation
  2013-09-26 10:16   ` Fredrik Gustafsson
  2013-09-26 10:26     ` Duy Nguyen
@ 2013-09-26 13:55     ` Karsten Blees
  1 sibling, 0 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-26 13:55 UTC (permalink / raw
  To: Fredrik Gustafsson; +Cc: Git List

Am 26.09.2013 12:16, schrieb Fredrik Gustafsson:
> On Tue, Sep 24, 2013 at 11:50:16AM +0200, Karsten Blees wrote:
>> Tests can be reproduced with 'time echo "perfhash[map] <method> 1000" | ./test-hashmap', see test-hashmap.c for definition of method flags.
> 
> So I'm still curious about the actual performance improvements for git.
> I runned git describe on the linux kernel with both the old hashmap and
> this new one:
> 

Performance was never the primary issue, the intention of the performance tests was to ensure that the new implementation doesn't *slow down* git.

>From the original PATCH/RFC:
- O(1) remove
- builtin entry chaining
- ready-to-use FNV-1 hash functions
- unit test
- additions are ~twice as fast
- uses less memory

So, the new implementation allows us to get rid of workarounds such as the CE_UNHASHED flag, duplicate entry chaining code and hash_name() implementations. It also addresses the memory usage FIXME in hash.h.

The simplified API may help prevent bugs such as the broken entry chaining in name-hash.c (see commits 2548183, 395c735, 2092678).

Maybe we can also replace some of the custom hash table implementations in attr.c, decorate.c, fast-import.c and object.c (to name just a few...).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/5] New hash table implementation
  2013-09-24 11:16   ` Tay Ray Chuan
@ 2013-09-26 14:38     ` Karsten Blees
  0 siblings, 0 replies; 24+ messages in thread
From: Karsten Blees @ 2013-09-26 14:38 UTC (permalink / raw
  To: Tay Ray Chuan; +Cc: Git List

Am 24.09.2013 13:16, schrieb Tay Ray Chuan:
> Hi Karsten,
> 
> On Tue, Sep 24, 2013 at 5:50 PM, Karsten Blees <karsten.blees@gmail.com> wrote:
>>
>>         |       add        |  get 100% hits  |    get 10% hits
>>         |  hash  | hashmap | hash  | hashmap |  hash   | hashmap
>> --------+--------+---------+-------+---------+---------+--------
>> FNV     | 14.815 |   2.345 | 3.059 |   1.642 |   4.085 |   0.976
>> FNV  x2 | 14.409 |   2.706 | 2.888 |   1.959 |   3.905 |   1.393
>> i       |  7.432 |   1.593 | 1.364 |   1.142 | 413.023 |   0.589
>> i    x2 |  9.169 |   1.866 | 1.427 |   1.163 |   0.757 |   0.670
>> i/10    |  1.800 |   1.555 | 5.365 |   6.465 |  32.918 |   1.052
>> i/10 x2 |  1.892 |   1.555 | 5.386 |   6.474 |   1.123 |   1.206
>>
>> Tests can be reproduced with 'time echo "perfhash[map] <method> 1000" | ./test-hashmap', see test-hashmap.c for definition of method flags.
> 
> I'm not sure if I'm reading the numbers right, but they look impressive!
> 

The numbers are for 100 million additions / lookups (1,000 rounds á 100,000 entries). Considering everything else that happens in git, the hash table performance should be insignificant, though.

> If it's not too much trouble, could you put together an API document,
> along the lines of Documentation/technical/api-hash.txt?

Yes, I had already planned to port the documentation to asciidoc. Although in my experience, API documentation in the header file tends to better stay in sync with code changes (but this only makes real sense with extraction tools such as doxygen).

> I could give
> a stab at replacing patience and histogram diff's hash implementation
> with yours.
> 

Open addressing (i.e. distributing conflicting entries to other buckes) *may* be faster *if* all data fits into the table (i.e. no pointers to the data are used). Scanning such a table (without following pointers) has very high locality and thus may benefit from accessing fewer CPU cache lines. The patience implementation seems to fall into this category (although the entry struct is fairly large, and it also uses the *2 trick to defeat bad hash codes (which wouldn't be necessary with chaining)).

Both patience and histogram use preallocated, fixed-size hash tables, and thus won't benefit from faster inserts (the 'add' performance numbers are for dynamically resized hash tables).

So, converting patience/histogram is probably not worth the trouble for performance reasons alone. If it also simplifies the algorithms and/or reduces memory usage - fine.

Ciao,
Karsten

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2013-09-26 14:38 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-10 23:27 [PATCH/RFC 0/5] New hash table implementation Karsten Blees
2013-09-10 23:28 ` [PATCH/RFC 1/5] add a hashtable implementation that supports O(1) removal Karsten Blees
2013-09-11 23:56   ` Junio C Hamano
2013-09-23  9:16     ` Karsten Blees
2013-09-12  4:10   ` Junio C Hamano
2013-09-23  9:21     ` Karsten Blees
2013-09-10 23:28 ` [PATCH/RFC 2/5] buitin/describe.c: use new hash map implementation Karsten Blees
2013-09-10 23:29 ` [PATCH/RFC 3/5] diffcore-rename.c: move code around to prepare for the next patch Karsten Blees
2013-09-10 23:30 ` [PATCH/RFC 4/5] diffcore-rename.c: simplify finding exact renames Karsten Blees
2013-09-10 23:30 ` [PATCH/RFC 5/5] diffcore-rename.c: use new hash map implementation Karsten Blees
2013-09-24  9:50 ` [PATCH v2 0/5] New hash table implementation Karsten Blees
2013-09-24  9:51   ` [PATCH v2 1/5] add a hashtable implementation that supports O(1) removal Karsten Blees
2013-09-24  9:52   ` [PATCH v2 2/5] buitin/describe.c: use new hash map implementation Karsten Blees
2013-09-24  9:52   ` [PATCH v2 3/5] diffcore-rename.c: move code around to prepare for the next patch Karsten Blees
2013-09-24  9:53   ` [PATCH v2 4/5] diffcore-rename.c: simplify finding exact renames Karsten Blees
2013-09-24  9:54   ` [PATCH v2 5/5] diffcore-rename.c: use new hash map implementation Karsten Blees
2013-09-24 10:18   ` [PATCH v2 0/5] New hash table implementation Fredrik Gustafsson
2013-09-24 11:16   ` Tay Ray Chuan
2013-09-26 14:38     ` Karsten Blees
2013-09-26 10:16   ` Fredrik Gustafsson
2013-09-26 10:26     ` Duy Nguyen
2013-09-26 11:08       ` Fredrik Gustafsson
2013-09-26 11:14         ` Duy Nguyen
2013-09-26 13:55     ` Karsten Blees

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).