git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] oidmap: map with OID as key
@ 2017-09-27 22:19 Jonathan Tan
  2017-09-28  0:41 ` Brandon Williams
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Jonathan Tan @ 2017-09-27 22:19 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, git

This is similar to using the hashmap in hashmap.c, but with an
easier-to-use API. In particular, custom entry comparisons no longer
need to be written, and lookups can be done without constructing a
temporary entry structure.

oidset has been updated to use oidmap.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
I have been wishing for an OID map for a while, and it seems to me that
Jeff Hostetler's recent patches [1] could use one too, so here it is.

(Also, I wonder if some details such as grow_at, shrink_at, and
do_count_items can be removed, since the applications of this are
narrower than that of hashmap.)

[1] https://public-inbox.org/git/20170922202632.53714-1-git@jeffhostetler.com/
---
 Makefile |   1 +
 oidmap.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 oidmap.h |  98 ++++++++++++++++++++++++++++++++++++++++
 oidset.c |  38 ++++------------
 oidset.h |   4 +-
 5 files changed, 263 insertions(+), 30 deletions(-)
 create mode 100644 oidmap.c
 create mode 100644 oidmap.h

diff --git a/Makefile b/Makefile
index ed4ca438b..64136dde4 100644
--- a/Makefile
+++ b/Makefile
@@ -821,6 +821,7 @@ LIB_OBJS += notes-cache.o
 LIB_OBJS += notes-merge.o
 LIB_OBJS += notes-utils.o
 LIB_OBJS += object.o
+LIB_OBJS += oidmap.o
 LIB_OBJS += oidset.o
 LIB_OBJS += packfile.o
 LIB_OBJS += pack-bitmap.o
diff --git a/oidmap.c b/oidmap.c
new file mode 100644
index 000000000..40e696cee
--- /dev/null
+++ b/oidmap.c
@@ -0,0 +1,152 @@
+#include "cache.h"
+#include "oidmap.h"
+
+#define OIDMAP_INITIAL_SIZE 64
+/* grow / shrink by 2^2 */
+#define OIDMAP_RESIZE_BITS 2
+/* load factor in percent */
+#define OIDMAP_LOAD_FACTOR 80
+
+static void alloc_table(struct oidmap *map, unsigned int size)
+{
+	map->tablesize = size;
+	map->table = xcalloc(size, sizeof(struct oidmap_entry *));
+
+	/* calculate resize thresholds for new size */
+	map->grow_at = (unsigned int) ((uint64_t) size * OIDMAP_LOAD_FACTOR / 100);
+	if (size <= OIDMAP_INITIAL_SIZE)
+		map->shrink_at = 0;
+	else
+		/*
+		 * The shrink-threshold must be slightly smaller than
+		 * (grow-threshold / resize-factor) to prevent erratic resizing,
+		 * thus we divide by (resize-factor + 1).
+		 */
+		map->shrink_at = map->grow_at / ((1 << OIDMAP_RESIZE_BITS) + 1);
+}
+
+static inline unsigned int bucket(const struct oidmap *map,
+				  const struct object_id *key)
+{
+	unsigned int hash;
+	memcpy(&hash, key->hash, sizeof(hash));
+	return hash & (map->tablesize - 1);
+}
+
+static void rehash(struct oidmap *map, unsigned int newsize)
+{
+	unsigned int i, oldsize = map->tablesize;
+	struct oidmap_entry **oldtable = map->table;
+
+	alloc_table(map, newsize);
+	for (i = 0; i < oldsize; i++) {
+		struct oidmap_entry *e = oldtable[i];
+		while (e) {
+			struct oidmap_entry *next = e->next;
+			unsigned int b = bucket(map, &e->oid);
+			e->next = map->table[b];
+			map->table[b] = e;
+			e = next;
+		}
+	}
+	free(oldtable);
+}
+
+static inline struct oidmap_entry **find_entry_ptr(const struct oidmap *map,
+						   const struct object_id *key)
+{
+	struct oidmap_entry **e = &map->table[bucket(map, key)];
+	while (*e && oidcmp(&(*e)->oid, key))
+		e = &(*e)->next;
+	return e;
+}
+
+void oidmap_init(struct oidmap *map, size_t initial_size)
+{
+	unsigned int size = OIDMAP_INITIAL_SIZE;
+
+	memset(map, 0, sizeof(*map));
+
+	/* calculate initial table size and allocate the table */
+	initial_size = (unsigned int) ((uint64_t) initial_size * 100
+			/ OIDMAP_LOAD_FACTOR);
+	while (initial_size > size)
+		size <<= OIDMAP_RESIZE_BITS;
+	alloc_table(map, size);
+
+	/*
+	 * Keep track of the number of items in the map and
+	 * allow the map to automatically grow as necessary.
+	 */
+	map->do_count_items = 1;
+}
+
+void oidmap_free(struct oidmap *map, int free_entries)
+{
+	if (!map || !map->table)
+		return;
+	if (free_entries) {
+		int i;
+		for (i = 0; i < map->tablesize; i++) {
+			struct oidmap_entry *e = map->table[i];
+			while (e) {
+				struct oidmap_entry *next = e->next;
+				free(e);
+				e = next;
+			}
+		}
+	}
+	free(map->table);
+	memset(map, 0, sizeof(*map));
+}
+
+void *oidmap_get(const struct oidmap *map, const struct object_id *key)
+{
+	return *find_entry_ptr(map, key);
+}
+
+static void oidmap_add(struct oidmap *map, struct oidmap_entry *entry)
+{
+	unsigned int b = bucket(map, &entry->oid);
+
+	/* add entry */
+	entry->next = map->table[b];
+	map->table[b] = entry;
+
+	/* fix size and rehash if appropriate */
+	if (map->do_count_items) {
+		map->private_size++;
+		if (map->private_size > map->grow_at)
+			rehash(map, map->tablesize << OIDMAP_RESIZE_BITS);
+	}
+}
+
+void *oidmap_remove(struct oidmap *map, const struct object_id *key)
+{
+	struct oidmap_entry *old;
+	struct oidmap_entry **e = find_entry_ptr(map, key);
+	if (!*e)
+		return NULL;
+
+	/* remove existing entry */
+	old = *e;
+	*e = old->next;
+	old->next = NULL;
+
+	/* fix size and rehash if appropriate */
+	if (map->do_count_items) {
+		map->private_size--;
+		if (map->private_size < map->shrink_at)
+			rehash(map, map->tablesize >> OIDMAP_RESIZE_BITS);
+	}
+
+	return old;
+}
+
+void *oidmap_put(struct oidmap *map, void *entry)
+{
+	struct oidmap_entry *to_put = entry;
+	struct oidmap_entry *old = oidmap_remove(map, &to_put->oid);
+	oidmap_add(map, to_put);
+	return old;
+}
diff --git a/oidmap.h b/oidmap.h
new file mode 100644
index 000000000..a543ed828
--- /dev/null
+++ b/oidmap.h
@@ -0,0 +1,98 @@
+#ifndef OIDMAP_H
+#define OIDMAP_H
+
+/*
+ * struct oidmap_entry is a structure representing an entry in the hash table,
+ * which must be used as first member of user data structures. It must be
+ * zero-initialized (or use OIDMAP_ENTRY_INIT).
+ */
+struct oidmap_entry {
+	/*
+	 * next points to the next entry in case of collisions (i.e. if
+	 * multiple entries map to the same bucket). For oidmap's internal use
+	 * only.
+	 */
+	struct oidmap_entry *next;
+
+	struct object_id oid;
+};
+#define OIDMAP_ENTRY_INIT { NULL }
+
+/*
+ * struct oidmap is the hash table structure. Members can be used as follows,
+ * but should not be modified directly.
+ */
+struct oidmap {
+	struct oidmap_entry **table;
+
+	/* total number of entries (0 means the hashmap is empty) */
+	unsigned int private_size; /* use oidmap_get_size() */
+
+	/*
+	 * tablesize is the allocated size of the hash table. A non-0 value
+	 * indicates that the hashmap is initialized. It may also be useful
+	 * for statistical purposes (i.e. `size / tablesize` is the current
+	 * load factor).
+	 */
+	unsigned int tablesize;
+
+	unsigned int grow_at;
+	unsigned int shrink_at;
+
+	unsigned int do_count_items : 1;
+};
+
+/* oidmap functions */
+
+/*
+ * Initializes an oidmap structure.
+ *
+ * `map` is the oidmap to initialize.
+ *
+ * If the total number of entries is known in advance, the `initial_size`
+ * parameter may be used to preallocate a sufficiently large table and thus
+ * prevent expensive resizing. If 0, the table is dynamically resized.
+ */
+extern void oidmap_init(struct oidmap *map, size_t initial_size);
+
+/*
+ * Frees an oidmap structure and allocated memory.
+ *
+ * If `free_entries` is true, each oidmap_entry in the map is freed as well
+ * using stdlibs free().
+ */
+extern void oidmap_free(struct oidmap *map, int free_entries);
+
+/*
+ * Return the number of items in the map.
+ */
+static inline unsigned int oidmap_get_size(struct oidmap *map)
+{
+	if (map->do_count_items)
+		return map->private_size;
+
+	BUG("oidmap_get_size: size not set");
+	return 0;
+}
+
+/*
+ * Returns the oidmap entry for the specified oid, or NULL if not found.
+ */
+extern void *oidmap_get(const struct oidmap *map,
+			const struct object_id *key);
+
+/*
+ * Adds or replaces an oidmap entry.
+ *
+ * Returns the replaced entry, or NULL if not found (i.e. the entry was added).
+ */
+extern void *oidmap_put(struct oidmap *map, void *entry);
+
+/*
+ * Removes an oidmap entry matching the specified oid.
+ *
+ * Returns the removed entry, or NULL if not found.
+ */
+extern void *oidmap_remove(struct oidmap *map, const struct object_id *key);
+
+#endif
diff --git a/oidset.c b/oidset.c
index a6a08ba52..6fe7036c4 100644
--- a/oidset.c
+++ b/oidset.c
@@ -1,50 +1,30 @@
 #include "cache.h"
 #include "oidset.h"
 
-struct oidset_entry {
-	struct hashmap_entry hash;
-	struct object_id oid;
-};
-
-static int oidset_hashcmp(const void *unused_cmp_data,
-			  const void *va, const void *vb,
-			  const void *vkey)
-{
-	const struct oidset_entry *a = va, *b = vb;
-	const struct object_id *key = vkey;
-	return oidcmp(&a->oid, key ? key : &b->oid);
-}
-
 int oidset_contains(const struct oidset *set, const struct object_id *oid)
 {
-	struct hashmap_entry key;
-
-	if (!set->map.cmpfn)
+	if (!set->map.tablesize)
 		return 0;
-
-	hashmap_entry_init(&key, sha1hash(oid->hash));
-	return !!hashmap_get(&set->map, &key, oid);
+	return !!oidmap_get(&set->map, oid);
 }
 
 int oidset_insert(struct oidset *set, const struct object_id *oid)
 {
-	struct oidset_entry *entry;
-
-	if (!set->map.cmpfn)
-		hashmap_init(&set->map, oidset_hashcmp, NULL, 0);
+	struct oidmap_entry *entry;
 
-	if (oidset_contains(set, oid))
+	if (!set->map.tablesize)
+		oidmap_init(&set->map, 0);
+	else if (oidset_contains(set, oid))
 		return 1;
 
-	entry = xmalloc(sizeof(*entry));
-	hashmap_entry_init(&entry->hash, sha1hash(oid->hash));
+	entry = xcalloc(1, sizeof(*entry));
 	oidcpy(&entry->oid, oid);
 
-	hashmap_add(&set->map, entry);
+	oidmap_put(&set->map, entry);
 	return 0;
 }
 
 void oidset_clear(struct oidset *set)
 {
-	hashmap_free(&set->map, 1);
+	oidmap_free(&set->map, 1);
 }
diff --git a/oidset.h b/oidset.h
index b7eaab5b8..b1ec82bfc 100644
--- a/oidset.h
+++ b/oidset.h
@@ -1,6 +1,8 @@
 #ifndef OIDSET_H
 #define OIDSET_H
 
+#include "oidmap.h"
+
 /**
  * This API is similar to sha1-array, in that it maintains a set of object ids
  * in a memory-efficient way. The major differences are:
@@ -17,7 +19,7 @@
  * A single oidset; should be zero-initialized (or use OIDSET_INIT).
  */
 struct oidset {
-	struct hashmap map;
+	struct oidmap map;
 };
 
 #define OIDSET_INIT { { NULL } }
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] oidmap: map with OID as key
  2017-09-27 22:19 [PATCH] oidmap: map with OID as key Jonathan Tan
@ 2017-09-28  0:41 ` Brandon Williams
  2017-09-28 17:46   ` Jonathan Tan
  2017-09-28  3:13 ` Junio C Hamano
  2017-09-29 22:54 ` [PATCH v2] " Jonathan Tan
  2 siblings, 1 reply; 16+ messages in thread
From: Brandon Williams @ 2017-09-28  0:41 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, git

On 09/27, Jonathan Tan wrote:
> This is similar to using the hashmap in hashmap.c, but with an
> easier-to-use API. In particular, custom entry comparisons no longer
> need to be written, and lookups can be done without constructing a
> temporary entry structure.
> 
> oidset has been updated to use oidmap.

After a quick glance at the code, I think it looks good.  I do have one
suggestion though.  This map is structured much like our internal
hashmap where if you want to include any data you have to implement your
own struct with the internal 'entry' struct as the first member in your
custom struct.  I personal dislike this method, despite the potential
memory savings, because it makes the data structure much more difficult
to use.

If all i wanted was a map from 'OID -> const char *' then I would need
to write all the boilerplate code to make a struct which contains the
map 'entry' struct + the 'const char *' entry.  You are also making the
caller responsible for allocating the individual entries instead of
letting the data structure take care of that internally.

To me it seems like a much simpler API for a map would be to just allow
callers to store a 'void *' as the value.

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] oidmap: map with OID as key
  2017-09-27 22:19 [PATCH] oidmap: map with OID as key Jonathan Tan
  2017-09-28  0:41 ` Brandon Williams
@ 2017-09-28  3:13 ` Junio C Hamano
  2017-09-28 17:38   ` Jonathan Tan
  2017-09-29 22:54 ` [PATCH v2] " Jonathan Tan
  2 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2017-09-28  3:13 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, git

Jonathan Tan <jonathantanmy@google.com> writes:

> This is similar to using the hashmap in hashmap.c, but with an
> easier-to-use API. In particular, custom entry comparisons no longer
> need to be written, and lookups can be done without constructing a
> temporary entry structure.

A naïve question is why this needs to duplicate so much code, just
to build something similar in spirit to hashmap but unlike hashmap
that can take caller-defined keys, limited to using oid as the keys,
instead of just being a thin API wrapper that uses hashmap as its
internal implementation detail.  

Is the way hashmap API is structured so hard to use it in such a
way, or something?

>  Makefile |   1 +
>  oidmap.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  oidmap.h |  98 ++++++++++++++++++++++++++++++++++++++++
>  oidset.c |  38 ++++------------
>  oidset.h |   4 +-
>  5 files changed, 263 insertions(+), 30 deletions(-)
>  create mode 100644 oidmap.c
>  create mode 100644 oidmap.h
>
> diff --git a/Makefile b/Makefile
> index ed4ca438b..64136dde4 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -821,6 +821,7 @@ LIB_OBJS += notes-cache.o
>  LIB_OBJS += notes-merge.o
>  LIB_OBJS += notes-utils.o
>  LIB_OBJS += object.o
> +LIB_OBJS += oidmap.o
>  LIB_OBJS += oidset.o
>  LIB_OBJS += packfile.o
>  LIB_OBJS += pack-bitmap.o
> diff --git a/oidmap.c b/oidmap.c
> new file mode 100644
> index 000000000..40e696cee
> --- /dev/null
> +++ b/oidmap.c
> @@ -0,0 +1,152 @@
> +#include "cache.h"
> +#include "oidmap.h"
> +
> +#define OIDMAP_INITIAL_SIZE 64
> +/* grow / shrink by 2^2 */
> +#define OIDMAP_RESIZE_BITS 2
> +/* load factor in percent */
> +#define OIDMAP_LOAD_FACTOR 80
> +
> +static void alloc_table(struct oidmap *map, unsigned int size)
> +{
> +	map->tablesize = size;
> +	map->table = xcalloc(size, sizeof(struct oidmap_entry *));
> +
> +	/* calculate resize thresholds for new size */
> +	map->grow_at = (unsigned int) ((uint64_t) size * OIDMAP_LOAD_FACTOR / 100);
> +	if (size <= OIDMAP_INITIAL_SIZE)
> +		map->shrink_at = 0;
> +	else
> +		/*
> +		 * The shrink-threshold must be slightly smaller than
> +		 * (grow-threshold / resize-factor) to prevent erratic resizing,
> +		 * thus we divide by (resize-factor + 1).
> +		 */
> +		map->shrink_at = map->grow_at / ((1 << OIDMAP_RESIZE_BITS) + 1);
> +}
> +
> +static inline unsigned int bucket(const struct oidmap *map,
> +				  const struct object_id *key)
> +{
> +	unsigned int hash;
> +	memcpy(&hash, key->hash, sizeof(hash));
> +	return hash & (map->tablesize - 1);
> +}
> +
> +static void rehash(struct oidmap *map, unsigned int newsize)
> +{
> +	unsigned int i, oldsize = map->tablesize;
> +	struct oidmap_entry **oldtable = map->table;
> +
> +	alloc_table(map, newsize);
> +	for (i = 0; i < oldsize; i++) {
> +		struct oidmap_entry *e = oldtable[i];
> +		while (e) {
> +			struct oidmap_entry *next = e->next;
> +			unsigned int b = bucket(map, &e->oid);
> +			e->next = map->table[b];
> +			map->table[b] = e;
> +			e = next;
> +		}
> +	}
> +	free(oldtable);
> +}
> +
> +static inline struct oidmap_entry **find_entry_ptr(const struct oidmap *map,
> +						   const struct object_id *key)
> +{
> +	struct oidmap_entry **e = &map->table[bucket(map, key)];
> +	while (*e && oidcmp(&(*e)->oid, key))
> +		e = &(*e)->next;
> +	return e;
> +}
> +
> +void oidmap_init(struct oidmap *map, size_t initial_size)
> +{
> +	unsigned int size = OIDMAP_INITIAL_SIZE;
> +
> +	memset(map, 0, sizeof(*map));
> +
> +	/* calculate initial table size and allocate the table */
> +	initial_size = (unsigned int) ((uint64_t) initial_size * 100
> +			/ OIDMAP_LOAD_FACTOR);
> +	while (initial_size > size)
> +		size <<= OIDMAP_RESIZE_BITS;
> +	alloc_table(map, size);
> +
> +	/*
> +	 * Keep track of the number of items in the map and
> +	 * allow the map to automatically grow as necessary.
> +	 */
> +	map->do_count_items = 1;
> +}
> +
> +void oidmap_free(struct oidmap *map, int free_entries)
> +{
> +	if (!map || !map->table)
> +		return;
> +	if (free_entries) {
> +		int i;
> +		for (i = 0; i < map->tablesize; i++) {
> +			struct oidmap_entry *e = map->table[i];
> +			while (e) {
> +				struct oidmap_entry *next = e->next;
> +				free(e);
> +				e = next;
> +			}
> +		}
> +	}
> +	free(map->table);
> +	memset(map, 0, sizeof(*map));
> +}
> +
> +void *oidmap_get(const struct oidmap *map, const struct object_id *key)
> +{
> +	return *find_entry_ptr(map, key);
> +}
> +
> +static void oidmap_add(struct oidmap *map, struct oidmap_entry *entry)
> +{
> +	unsigned int b = bucket(map, &entry->oid);
> +
> +	/* add entry */
> +	entry->next = map->table[b];
> +	map->table[b] = entry;
> +
> +	/* fix size and rehash if appropriate */
> +	if (map->do_count_items) {
> +		map->private_size++;
> +		if (map->private_size > map->grow_at)
> +			rehash(map, map->tablesize << OIDMAP_RESIZE_BITS);
> +	}
> +}
> +
> +void *oidmap_remove(struct oidmap *map, const struct object_id *key)
> +{
> +	struct oidmap_entry *old;
> +	struct oidmap_entry **e = find_entry_ptr(map, key);
> +	if (!*e)
> +		return NULL;
> +
> +	/* remove existing entry */
> +	old = *e;
> +	*e = old->next;
> +	old->next = NULL;
> +
> +	/* fix size and rehash if appropriate */
> +	if (map->do_count_items) {
> +		map->private_size--;
> +		if (map->private_size < map->shrink_at)
> +			rehash(map, map->tablesize >> OIDMAP_RESIZE_BITS);
> +	}
> +
> +	return old;
> +}
> +
> +void *oidmap_put(struct oidmap *map, void *entry)
> +{
> +	struct oidmap_entry *to_put = entry;
> +	struct oidmap_entry *old = oidmap_remove(map, &to_put->oid);
> +	oidmap_add(map, to_put);
> +	return old;
> +}
> diff --git a/oidmap.h b/oidmap.h
> new file mode 100644
> index 000000000..a543ed828
> --- /dev/null
> +++ b/oidmap.h
> @@ -0,0 +1,98 @@
> +#ifndef OIDMAP_H
> +#define OIDMAP_H
> +
> +/*
> + * struct oidmap_entry is a structure representing an entry in the hash table,
> + * which must be used as first member of user data structures. It must be
> + * zero-initialized (or use OIDMAP_ENTRY_INIT).
> + */
> +struct oidmap_entry {
> +	/*
> +	 * next points to the next entry in case of collisions (i.e. if
> +	 * multiple entries map to the same bucket). For oidmap's internal use
> +	 * only.
> +	 */
> +	struct oidmap_entry *next;
> +
> +	struct object_id oid;
> +};
> +#define OIDMAP_ENTRY_INIT { NULL }
> +
> +/*
> + * struct oidmap is the hash table structure. Members can be used as follows,
> + * but should not be modified directly.
> + */
> +struct oidmap {
> +	struct oidmap_entry **table;
> +
> +	/* total number of entries (0 means the hashmap is empty) */
> +	unsigned int private_size; /* use oidmap_get_size() */
> +
> +	/*
> +	 * tablesize is the allocated size of the hash table. A non-0 value
> +	 * indicates that the hashmap is initialized. It may also be useful
> +	 * for statistical purposes (i.e. `size / tablesize` is the current
> +	 * load factor).
> +	 */
> +	unsigned int tablesize;
> +
> +	unsigned int grow_at;
> +	unsigned int shrink_at;
> +
> +	unsigned int do_count_items : 1;
> +};
> +
> +/* oidmap functions */
> +
> +/*
> + * Initializes an oidmap structure.
> + *
> + * `map` is the oidmap to initialize.
> + *
> + * If the total number of entries is known in advance, the `initial_size`
> + * parameter may be used to preallocate a sufficiently large table and thus
> + * prevent expensive resizing. If 0, the table is dynamically resized.
> + */
> +extern void oidmap_init(struct oidmap *map, size_t initial_size);
> +
> +/*
> + * Frees an oidmap structure and allocated memory.
> + *
> + * If `free_entries` is true, each oidmap_entry in the map is freed as well
> + * using stdlibs free().
> + */
> +extern void oidmap_free(struct oidmap *map, int free_entries);
> +
> +/*
> + * Return the number of items in the map.
> + */
> +static inline unsigned int oidmap_get_size(struct oidmap *map)
> +{
> +	if (map->do_count_items)
> +		return map->private_size;
> +
> +	BUG("oidmap_get_size: size not set");
> +	return 0;
> +}
> +
> +/*
> + * Returns the oidmap entry for the specified oid, or NULL if not found.
> + */
> +extern void *oidmap_get(const struct oidmap *map,
> +			const struct object_id *key);
> +
> +/*
> + * Adds or replaces an oidmap entry.
> + *
> + * Returns the replaced entry, or NULL if not found (i.e. the entry was added).
> + */
> +extern void *oidmap_put(struct oidmap *map, void *entry);
> +
> +/*
> + * Removes an oidmap entry matching the specified oid.
> + *
> + * Returns the removed entry, or NULL if not found.
> + */
> +extern void *oidmap_remove(struct oidmap *map, const struct object_id *key);
> +
> +#endif
> diff --git a/oidset.c b/oidset.c
> index a6a08ba52..6fe7036c4 100644
> --- a/oidset.c
> +++ b/oidset.c
> @@ -1,50 +1,30 @@
>  #include "cache.h"
>  #include "oidset.h"
>  
> -struct oidset_entry {
> -	struct hashmap_entry hash;
> -	struct object_id oid;
> -};
> -
> -static int oidset_hashcmp(const void *unused_cmp_data,
> -			  const void *va, const void *vb,
> -			  const void *vkey)
> -{
> -	const struct oidset_entry *a = va, *b = vb;
> -	const struct object_id *key = vkey;
> -	return oidcmp(&a->oid, key ? key : &b->oid);
> -}
> -
>  int oidset_contains(const struct oidset *set, const struct object_id *oid)
>  {
> -	struct hashmap_entry key;
> -
> -	if (!set->map.cmpfn)
> +	if (!set->map.tablesize)
>  		return 0;
> -
> -	hashmap_entry_init(&key, sha1hash(oid->hash));
> -	return !!hashmap_get(&set->map, &key, oid);
> +	return !!oidmap_get(&set->map, oid);
>  }
>  
>  int oidset_insert(struct oidset *set, const struct object_id *oid)
>  {
> -	struct oidset_entry *entry;
> -
> -	if (!set->map.cmpfn)
> -		hashmap_init(&set->map, oidset_hashcmp, NULL, 0);
> +	struct oidmap_entry *entry;
>  
> -	if (oidset_contains(set, oid))
> +	if (!set->map.tablesize)
> +		oidmap_init(&set->map, 0);
> +	else if (oidset_contains(set, oid))
>  		return 1;
>  
> -	entry = xmalloc(sizeof(*entry));
> -	hashmap_entry_init(&entry->hash, sha1hash(oid->hash));
> +	entry = xcalloc(1, sizeof(*entry));
>  	oidcpy(&entry->oid, oid);
>  
> -	hashmap_add(&set->map, entry);
> +	oidmap_put(&set->map, entry);
>  	return 0;
>  }
>  
>  void oidset_clear(struct oidset *set)
>  {
> -	hashmap_free(&set->map, 1);
> +	oidmap_free(&set->map, 1);
>  }
> diff --git a/oidset.h b/oidset.h
> index b7eaab5b8..b1ec82bfc 100644
> --- a/oidset.h
> +++ b/oidset.h
> @@ -1,6 +1,8 @@
>  #ifndef OIDSET_H
>  #define OIDSET_H
>  
> +#include "oidmap.h"
> +
>  /**
>   * This API is similar to sha1-array, in that it maintains a set of object ids
>   * in a memory-efficient way. The major differences are:
> @@ -17,7 +19,7 @@
>   * A single oidset; should be zero-initialized (or use OIDSET_INIT).
>   */
>  struct oidset {
> -	struct hashmap map;
> +	struct oidmap map;
>  };
>  
>  #define OIDSET_INIT { { NULL } }

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] oidmap: map with OID as key
  2017-09-28  3:13 ` Junio C Hamano
@ 2017-09-28 17:38   ` Jonathan Tan
  0 siblings, 0 replies; 16+ messages in thread
From: Jonathan Tan @ 2017-09-28 17:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, git

On Thu, 28 Sep 2017 12:13:00 +0900
Junio C Hamano <gitster@pobox.com> wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > This is similar to using the hashmap in hashmap.c, but with an
> > easier-to-use API. In particular, custom entry comparisons no longer
> > need to be written, and lookups can be done without constructing a
> > temporary entry structure.
> 
> A naïve question is why this needs to duplicate so much code, just
> to build something similar in spirit to hashmap but unlike hashmap
> that can take caller-defined keys, limited to using oid as the keys,
> instead of just being a thin API wrapper that uses hashmap as its
> internal implementation detail.  
> 
> Is the way hashmap API is structured so hard to use it in such a
> way, or something?

Another reason that I probably should have mentioned is the opportunity
to save 4 bytes. I didn't mention it in the commit message at first
because I felt that the benefit was very specific and, as far as I know,
wouldn't be seen on architectures with 8-byte-aligned pointers until we
change the hash function. But I should have probably written something
like this in the commit message:

    Using oidmap also saves 4 bytes per entry, compared to hashmap,
    because the hash does not need to be stored separately as an int.
    (However, alignment considerations might mean that we will not
    observe these savings until we change the hash function, since
    currently 20-byte hashes are being used.)

If we decide that the 4 bytes are not important, right now at least, I
can change it so that this is a thin API wrapper over hashmap. We could
always change it later.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] oidmap: map with OID as key
  2017-09-28  0:41 ` Brandon Williams
@ 2017-09-28 17:46   ` Jonathan Tan
  2017-09-28 20:05     ` Jeff King
  0 siblings, 1 reply; 16+ messages in thread
From: Jonathan Tan @ 2017-09-28 17:46 UTC (permalink / raw)
  To: Brandon Williams; +Cc: git, git

On Wed, 27 Sep 2017 17:41:37 -0700
Brandon Williams <bmwill@google.com> wrote:

> On 09/27, Jonathan Tan wrote:
> > This is similar to using the hashmap in hashmap.c, but with an
> > easier-to-use API. In particular, custom entry comparisons no longer
> > need to be written, and lookups can be done without constructing a
> > temporary entry structure.
> > 
> > oidset has been updated to use oidmap.
> 
> After a quick glance at the code, I think it looks good.  I do have one
> suggestion though.  This map is structured much like our internal
> hashmap where if you want to include any data you have to implement your
> own struct with the internal 'entry' struct as the first member in your
> custom struct.  I personal dislike this method, despite the potential
> memory savings, because it makes the data structure much more difficult
> to use.
> 
> If all i wanted was a map from 'OID -> const char *' then I would need
> to write all the boilerplate code to make a struct which contains the
> map 'entry' struct + the 'const char *' entry.
>
> You are also making the
> caller responsible for allocating the individual entries instead of
> letting the data structure take care of that internally.

In the case where your extra data ("const char *" in your example) fits
in a pointer, yes it's true that the "util" design eliminates the need
to define and allocate a struct. But if you need to store more than
that, you will still have that need.

> To me it seems like a much simpler API for a map would be to just allow
> callers to store a 'void *' as the value.

I agree that the API would be simpler.

My main motivation with this design is indeed to save memory, and not
inconvenience the user too much (in the case where you're storing things
larger than one pointer, you just need to remember to put the special
struct at the beginning of your struct), but if memory is not so
important, I agree that we can switch to the "util" design.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] oidmap: map with OID as key
  2017-09-28 17:46   ` Jonathan Tan
@ 2017-09-28 20:05     ` Jeff King
  2017-09-29 19:04       ` Jonathan Tan
  2017-09-29 21:43       ` Johannes Schindelin
  0 siblings, 2 replies; 16+ messages in thread
From: Jeff King @ 2017-09-28 20:05 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Brandon Williams, git, git

On Thu, Sep 28, 2017 at 10:46:16AM -0700, Jonathan Tan wrote:

> > To me it seems like a much simpler API for a map would be to just allow
> > callers to store a 'void *' as the value.
> 
> I agree that the API would be simpler.
> 
> My main motivation with this design is indeed to save memory, and not
> inconvenience the user too much (in the case where you're storing things
> larger than one pointer, you just need to remember to put the special
> struct at the beginning of your struct), but if memory is not so
> important, I agree that we can switch to the "util" design.

When I saw that you were implementing "oidset" in terms of "oidmap", I
was all ready to be crabby about this extra memory. But then I saw that
the implementation tries hard not to waste any memory. :)

All of which is to say I gave this some thought when I was in the "ready
to be crabby" phase, and came to the conclusion that it probably isn't
that painful. An unused pointer is 8 bytes per entry. We're already
spending 20 for the oid itself (which is likely to grow to 32
eventually), plus 8 for the chained "next" pointer. Plus potentially 8
for a padded version of the hash, if we just use a straight hashmap that
duplicates the hash field.

So depending how you count it, we're wasting between 28% (sha1 and no
extra hash) and 16% (sha256 plus reusing hashmap). That's not great, but
it's probably not breaking the bank.

Another way of thinking about it. Large-ish (but not insane) repos have
on the order of 5-10 million objects. If we had an oidset that mentioned
every single object in the repository, that's 40-80MB wasted in the
worst case. For current uses of oidset, that's probably fine. It's
generally used only to collect ref tips (so probably two orders of
magnitude less).

If you're planning on using an oidset to mark every object in a
100-million-object monorepo, we'd probably care more. But I'd venture to
say that any scheme which involves generating that hash table on the fly
is doing it wrong. At at that scale we'd want to look at compact
mmap-able on-disk representations.

So I think we may be better off going with the solution here that's
simpler and requires introducing less code. If it does turn out to be a
memory problem in the future, this is a _really_ easy thing to optimize
after the fact, because we have these nice abstractions.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] oidmap: map with OID as key
  2017-09-28 20:05     ` Jeff King
@ 2017-09-29 19:04       ` Jonathan Tan
  2017-09-29 19:26         ` Jeff King
  2017-09-29 21:43       ` Johannes Schindelin
  1 sibling, 1 reply; 16+ messages in thread
From: Jonathan Tan @ 2017-09-29 19:04 UTC (permalink / raw)
  To: Jeff King; +Cc: Brandon Williams, git, git

On Thu, 28 Sep 2017 16:05:57 -0400
Jeff King <peff@peff.net> wrote:

> When I saw that you were implementing "oidset" in terms of "oidmap", I
> was all ready to be crabby about this extra memory. But then I saw that
> the implementation tries hard not to waste any memory. :)
> 
> All of which is to say I gave this some thought when I was in the "ready
> to be crabby" phase, and came to the conclusion that it probably isn't
> that painful. An unused pointer is 8 bytes per entry. We're already
> spending 20 for the oid itself (which is likely to grow to 32
> eventually), plus 8 for the chained "next" pointer. Plus potentially 8
> for a padded version of the hash, if we just use a straight hashmap that
> duplicates the hash field.
> 
> So depending how you count it, we're wasting between 28% (sha1 and no
> extra hash) and 16% (sha256 plus reusing hashmap). That's not great, but
> it's probably not breaking the bank.

Hmm...how did you get the 16% figure? The values, as I see it, are:
 - 32 for the sha256 hash itself
 - 8 for the "next" chain pointer
 - 8 for the padded hash
 - 8 for the "util" pointer

For an oidset, the padded hash and the "util" pointer are wasted, which is
16/56=0.286. (If you assume, say, 8 bytes of malloc bookkeeping overhead, then
16/64=0.25.)

For other uses of an oidmap, the padded hash and "util" pointer are still
wasted, so the numbers are the same as above (except that the malloc
bookkeeping overhead is doubled).

> Another way of thinking about it. Large-ish (but not insane) repos have
> on the order of 5-10 million objects. If we had an oidset that mentioned
> every single object in the repository, that's 40-80MB wasted in the
> worst case. For current uses of oidset, that's probably fine. It's
> generally used only to collect ref tips (so probably two orders of
> magnitude less).
> 
> If you're planning on using an oidset to mark every object in a
> 100-million-object monorepo, we'd probably care more. But I'd venture to
> say that any scheme which involves generating that hash table on the fly
> is doing it wrong. At at that scale we'd want to look at compact
> mmap-able on-disk representations.

In a 100-million-object monorepo, we will probably end up only operating on the
"frontier" objects (objects that we do not have but we know about because some
object we have reference them) at the worst. I don't have numbers but I think
that that is at least an order of magnitude less than 100M.

> So I think we may be better off going with the solution here that's
> simpler and requires introducing less code. If it does turn out to be a
> memory problem in the future, this is a _really_ easy thing to optimize
> after the fact, because we have these nice abstractions.

Optimizing away the padded hash should be straightforward. Optimizing away the
"util" pointer after we have code that uses it is (slightly?) less
straightforward, but still doable.

I still lean towards saving memory by eliminating the padded hash and
not using the "util" pointer, because the complication is contained
within only two files (oidmap.{c,h}), and because the constant factors
in memory savings might end up mattering. But I don't feel strongly
about that - I think any of the oidmaps that we have discussed is an
improvement over what we have now.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] oidmap: map with OID as key
  2017-09-29 19:04       ` Jonathan Tan
@ 2017-09-29 19:26         ` Jeff King
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2017-09-29 19:26 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Brandon Williams, git, git

On Fri, Sep 29, 2017 at 12:04:56PM -0700, Jonathan Tan wrote:

> > So depending how you count it, we're wasting between 28% (sha1 and no
> > extra hash) and 16% (sha256 plus reusing hashmap). That's not great, but
> > it's probably not breaking the bank.
> 
> Hmm...how did you get the 16% figure? The values, as I see it, are:
>  - 32 for the sha256 hash itself
>  - 8 for the "next" chain pointer
>  - 8 for the padded hash
>  - 8 for the "util" pointer
> 
> For an oidset, the padded hash and the "util" pointer are wasted, which is
> 16/56=0.286. (If you assume, say, 8 bytes of malloc bookkeeping overhead, then
> 16/64=0.25.)

Sorry to be unclear. I was just counting the waste for the "util"
pointer, not the extra padded hash bit (which AFAIK is already wasted in
oidset). So I computed 48 bytes without the util pointer, which means we
waste an additional 16% to add it.

Anyway, my point was mostly to say that this is a fractional percentage
of the total memory. So it falls into the category of "this might help
in tight situations" and less "this will blow up in our faces".

> In a 100-million-object monorepo, we will probably end up only operating on the
> "frontier" objects (objects that we do not have but we know about because some
> object we have reference them) at the worst. I don't have numbers but I think
> that that is at least an order of magnitude less than 100M.

Agreed.

> > So I think we may be better off going with the solution here that's
> > simpler and requires introducing less code. If it does turn out to be a
> > memory problem in the future, this is a _really_ easy thing to optimize
> > after the fact, because we have these nice abstractions.
> 
> Optimizing away the padded hash should be straightforward. Optimizing away the
> "util" pointer after we have code that uses it is (slightly?) less
> straightforward, but still doable.

I was thinking here of just oidset. It has a limited API, so swapping
out the implementation for one that does not depend on oidmap and waste
the "util" pointer would be the "easy" part.

> I still lean towards saving memory by eliminating the padded hash and
> not using the "util" pointer, because the complication is contained
> within only two files (oidmap.{c,h}), and because the constant factors
> in memory savings might end up mattering. But I don't feel strongly
> about that - I think any of the oidmaps that we have discussed is an
> improvement over what we have now.

My main concern is not so much about complexity bleeding out of the
oidmap code, but that now we have two parallel near-identical hashmap
implementations.  When one gets an optimization, bugfix, or new feature,
we have to port it over manually to the other.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] oidmap: map with OID as key
  2017-09-28 20:05     ` Jeff King
  2017-09-29 19:04       ` Jonathan Tan
@ 2017-09-29 21:43       ` Johannes Schindelin
  2017-09-29 23:24         ` Jeff King
  1 sibling, 1 reply; 16+ messages in thread
From: Johannes Schindelin @ 2017-09-29 21:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, Brandon Williams, git, git

Hi Peff,

On Thu, 28 Sep 2017, Jeff King wrote:

> If you're planning on using an oidset to mark every object in a
> 100-million-object monorepo, we'd probably care more. But I'd venture to
> say that any scheme which involves generating that hash table on the fly
> is doing it wrong. At at that scale we'd want to look at compact
> mmap-able on-disk representations.

Or maybe you would look at a *not-so-compact* mmap()able on-disk
representation, to allow for painless updates.

You really will want to avoid having to write out large files just because
a small part of them changed. We learn that lesson the hard way, from
having to write 350MB worth of .git/index for every single, painful `git
add` operation.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2] oidmap: map with OID as key
  2017-09-27 22:19 [PATCH] oidmap: map with OID as key Jonathan Tan
  2017-09-28  0:41 ` Brandon Williams
  2017-09-28  3:13 ` Junio C Hamano
@ 2017-09-29 22:54 ` Jonathan Tan
  2017-10-02 23:48   ` Brandon Williams
  2 siblings, 1 reply; 16+ messages in thread
From: Jonathan Tan @ 2017-09-29 22:54 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, bmwill, gitster, peff

This is similar to using the hashmap in hashmap.c, but with an
easier-to-use API. In particular, custom entry comparisons no longer
need to be written, and lookups can be done without constructing a
temporary entry structure.

This is implemented as a thin wrapper over the hashmap API. In
particular, this means that there is an additional 4-byte overhead due
to the fact that the first 4 bytes of the hash is redundantly stored.
For now, I'm taking the simpler approach, but if need be, we can
reimplement oidmap without affecting the callers significantly.

oidset has been updated to use oidmap.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
Some replies to v1 [1] [2] seem to indicate that simpler non-duplicated
code should be preferred over optimizing away the storage of the 4-byte
hash code, and I have no objection to that, so I have updated this code
to be a thin wrapper over hashmap with the 4-byte overhead.

After this patch, if the 4-byte overhead is found to be too much, we can
migrate to something similar to v1 relatively easily.

I decided not to go with the util pointer method because we will not be
able to migrate away from it so easily if need be.

[1] https://public-inbox.org/git/xmqqr2ur348z.fsf@gitster.mtv.corp.google.com/
[2] https://public-inbox.org/git/20170929192624.4ukvpjujgiyzgibb@sigill.intra.peff.net/
---
 Makefile     |  1 +
 fetch-pack.c |  2 +-
 oidmap.c     | 51 +++++++++++++++++++++++++++++++++++++++++++++
 oidmap.h     | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 oidset.c     | 36 +++++++-------------------------
 oidset.h     |  6 ++++--
 6 files changed, 133 insertions(+), 31 deletions(-)
 create mode 100644 oidmap.c
 create mode 100644 oidmap.h

diff --git a/Makefile b/Makefile
index ed4ca438b..64136dde4 100644
--- a/Makefile
+++ b/Makefile
@@ -821,6 +821,7 @@ LIB_OBJS += notes-cache.o
 LIB_OBJS += notes-merge.o
 LIB_OBJS += notes-utils.o
 LIB_OBJS += object.o
+LIB_OBJS += oidmap.o
 LIB_OBJS += oidset.o
 LIB_OBJS += packfile.o
 LIB_OBJS += pack-bitmap.o
diff --git a/fetch-pack.c b/fetch-pack.c
index 105506e9a..008b25d3d 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -611,7 +611,7 @@ static int tip_oids_contain(struct oidset *tip_oids,
 	 * add to "newlist" between calls, the additions will always be for
 	 * oids that are already in the set.
 	 */
-	if (!tip_oids->map.tablesize) {
+	if (!tip_oids->map.map.tablesize) {
 		add_refs_to_oidset(tip_oids, unmatched);
 		add_refs_to_oidset(tip_oids, newlist);
 	}
diff --git a/oidmap.c b/oidmap.c
new file mode 100644
index 000000000..6db4fffcd
--- /dev/null
+++ b/oidmap.c
@@ -0,0 +1,51 @@
+#include "cache.h"
+#include "oidmap.h"
+
+static int cmpfn(const void *hashmap_cmp_fn_data,
+		 const void *entry, const void *entry_or_key,
+		 const void *keydata)
+{
+	const struct oidmap_entry *entry_ = entry;
+	if (keydata)
+		return oidcmp(&entry_->oid, (const struct object_id *) keydata);
+	return oidcmp(&entry_->oid,
+		      &((const struct oidmap_entry *) entry_or_key)->oid);
+}
+
+static int hash(const struct object_id *oid)
+{
+	int hash;
+	memcpy(&hash, oid->hash, sizeof(hash));
+	return hash;
+}
+
+void oidmap_init(struct oidmap *map, size_t initial_size)
+{
+	hashmap_init(&map->map, cmpfn, NULL, initial_size);
+}
+
+void oidmap_free(struct oidmap *map, int free_entries)
+{
+	if (!map)
+		return;
+	hashmap_free(&map->map, free_entries);
+}
+
+void *oidmap_get(const struct oidmap *map, const struct object_id *key)
+{
+	return hashmap_get_from_hash(&map->map, hash(key), key);
+}
+
+void *oidmap_remove(struct oidmap *map, const struct object_id *key)
+{
+	struct hashmap_entry entry;
+	hashmap_entry_init(&entry, hash(key));
+	return hashmap_remove(&map->map, &entry, key);
+}
+
+void *oidmap_put(struct oidmap *map, void *entry)
+{
+	struct oidmap_entry *to_put = entry;
+	hashmap_entry_init(&to_put->internal_entry, hash(&to_put->oid));
+	return hashmap_put(&map->map, to_put);
+}
diff --git a/oidmap.h b/oidmap.h
new file mode 100644
index 000000000..18f54cde1
--- /dev/null
+++ b/oidmap.h
@@ -0,0 +1,68 @@
+#ifndef OIDMAP_H
+#define OIDMAP_H
+
+#include "hashmap.h"
+
+/*
+ * struct oidmap_entry is a structure representing an entry in the hash table,
+ * which must be used as first member of user data structures.
+ *
+ * Users should set the oid field. oidmap_put() will populate the
+ * internal_entry field.
+ */
+struct oidmap_entry {
+	/* For internal use only */
+	struct hashmap_entry internal_entry;
+
+	struct object_id oid;
+};
+
+struct oidmap {
+	struct hashmap map;
+};
+
+#define OIDMAP_INIT { { NULL } }
+
+/*
+ * Initializes an oidmap structure.
+ *
+ * `map` is the oidmap to initialize.
+ *
+ * If the total number of entries is known in advance, the `initial_size`
+ * parameter may be used to preallocate a sufficiently large table and thus
+ * prevent expensive resizing. If 0, the table is dynamically resized.
+ */
+extern void oidmap_init(struct oidmap *map, size_t initial_size);
+
+/*
+ * Frees an oidmap structure and allocated memory.
+ *
+ * If `free_entries` is true, each oidmap_entry in the map is freed as well
+ * using stdlibs free().
+ */
+extern void oidmap_free(struct oidmap *map, int free_entries);
+
+/*
+ * Returns the oidmap entry for the specified oid, or NULL if not found.
+ */
+extern void *oidmap_get(const struct oidmap *map,
+			const struct object_id *key);
+
+/*
+ * Adds or replaces an oidmap entry.
+ *
+ * ((struct oidmap_entry *) entry)->internal_entry will be populated by this
+ * function.
+ *
+ * Returns the replaced entry, or NULL if not found (i.e. the entry was added).
+ */
+extern void *oidmap_put(struct oidmap *map, void *entry);
+
+/*
+ * Removes an oidmap entry matching the specified oid.
+ *
+ * Returns the removed entry, or NULL if not found.
+ */
+extern void *oidmap_remove(struct oidmap *map, const struct object_id *key);
+
+#endif
diff --git a/oidset.c b/oidset.c
index a6a08ba52..f1f874aaa 100644
--- a/oidset.c
+++ b/oidset.c
@@ -1,50 +1,30 @@
 #include "cache.h"
 #include "oidset.h"
 
-struct oidset_entry {
-	struct hashmap_entry hash;
-	struct object_id oid;
-};
-
-static int oidset_hashcmp(const void *unused_cmp_data,
-			  const void *va, const void *vb,
-			  const void *vkey)
-{
-	const struct oidset_entry *a = va, *b = vb;
-	const struct object_id *key = vkey;
-	return oidcmp(&a->oid, key ? key : &b->oid);
-}
-
 int oidset_contains(const struct oidset *set, const struct object_id *oid)
 {
-	struct hashmap_entry key;
-
-	if (!set->map.cmpfn)
+	if (!set->map.map.tablesize)
 		return 0;
-
-	hashmap_entry_init(&key, sha1hash(oid->hash));
-	return !!hashmap_get(&set->map, &key, oid);
+	return !!oidmap_get(&set->map, oid);
 }
 
 int oidset_insert(struct oidset *set, const struct object_id *oid)
 {
-	struct oidset_entry *entry;
-
-	if (!set->map.cmpfn)
-		hashmap_init(&set->map, oidset_hashcmp, NULL, 0);
+	struct oidmap_entry *entry;
 
-	if (oidset_contains(set, oid))
+	if (!set->map.map.tablesize)
+		oidmap_init(&set->map, 0);
+	else if (oidset_contains(set, oid))
 		return 1;
 
 	entry = xmalloc(sizeof(*entry));
-	hashmap_entry_init(&entry->hash, sha1hash(oid->hash));
 	oidcpy(&entry->oid, oid);
 
-	hashmap_add(&set->map, entry);
+	oidmap_put(&set->map, entry);
 	return 0;
 }
 
 void oidset_clear(struct oidset *set)
 {
-	hashmap_free(&set->map, 1);
+	oidmap_free(&set->map, 1);
 }
diff --git a/oidset.h b/oidset.h
index b7eaab5b8..f4c9e0f9c 100644
--- a/oidset.h
+++ b/oidset.h
@@ -1,6 +1,8 @@
 #ifndef OIDSET_H
 #define OIDSET_H
 
+#include "oidmap.h"
+
 /**
  * This API is similar to sha1-array, in that it maintains a set of object ids
  * in a memory-efficient way. The major differences are:
@@ -17,10 +19,10 @@
  * A single oidset; should be zero-initialized (or use OIDSET_INIT).
  */
 struct oidset {
-	struct hashmap map;
+	struct oidmap map;
 };
 
-#define OIDSET_INIT { { NULL } }
+#define OIDSET_INIT { OIDMAP_INIT }
 
 /**
  * Returns true iff `set` contains `oid`.
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] oidmap: map with OID as key
  2017-09-29 21:43       ` Johannes Schindelin
@ 2017-09-29 23:24         ` Jeff King
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2017-09-29 23:24 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jonathan Tan, Brandon Williams, git, git

On Fri, Sep 29, 2017 at 11:43:57PM +0200, Johannes Schindelin wrote:

> On Thu, 28 Sep 2017, Jeff King wrote:
> 
> > If you're planning on using an oidset to mark every object in a
> > 100-million-object monorepo, we'd probably care more. But I'd venture to
> > say that any scheme which involves generating that hash table on the fly
> > is doing it wrong. At at that scale we'd want to look at compact
> > mmap-able on-disk representations.
> 
> Or maybe you would look at a *not-so-compact* mmap()able on-disk
> representation, to allow for painless updates.
> 
> You really will want to avoid having to write out large files just because
> a small part of them changed. We learn that lesson the hard way, from
> having to write 350MB worth of .git/index for every single, painful `git
> add` operation.

Sure. I didn't mean to start designing the format. I just mean that if
the first step of the process is "read information about all 100 million
objects into an in-RAM hashmap", then that is definitely not going to
fly.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] oidmap: map with OID as key
  2017-09-29 22:54 ` [PATCH v2] " Jonathan Tan
@ 2017-10-02 23:48   ` Brandon Williams
  2017-10-03  6:31     ` Jeff King
  0 siblings, 1 reply; 16+ messages in thread
From: Brandon Williams @ 2017-10-02 23:48 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster, peff

On 09/29, Jonathan Tan wrote:
> This is similar to using the hashmap in hashmap.c, but with an
> easier-to-use API. In particular, custom entry comparisons no longer
> need to be written, and lookups can be done without constructing a
> temporary entry structure.
> 
> This is implemented as a thin wrapper over the hashmap API. In
> particular, this means that there is an additional 4-byte overhead due
> to the fact that the first 4 bytes of the hash is redundantly stored.
> For now, I'm taking the simpler approach, but if need be, we can
> reimplement oidmap without affecting the callers significantly.
> 
> oidset has been updated to use oidmap.
> 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> Some replies to v1 [1] [2] seem to indicate that simpler non-duplicated
> code should be preferred over optimizing away the storage of the 4-byte
> hash code, and I have no objection to that, so I have updated this code
> to be a thin wrapper over hashmap with the 4-byte overhead.
> 
> After this patch, if the 4-byte overhead is found to be too much, we can
> migrate to something similar to v1 relatively easily.
> 
> I decided not to go with the util pointer method because we will not be
> able to migrate away from it so easily if need be.

This makes me a bit sad because I tend to lean more towards making
things simpler.  I'm still a supporter of the 'util' pointer but I
understand why we would choose otherwise.

> 
> [1] https://public-inbox.org/git/xmqqr2ur348z.fsf@gitster.mtv.corp.google.com/
> [2] https://public-inbox.org/git/20170929192624.4ukvpjujgiyzgibb@sigill.intra.peff.net/
> ---
>  Makefile     |  1 +
>  fetch-pack.c |  2 +-
>  oidmap.c     | 51 +++++++++++++++++++++++++++++++++++++++++++++
>  oidmap.h     | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  oidset.c     | 36 +++++++-------------------------
>  oidset.h     |  6 ++++--
>  6 files changed, 133 insertions(+), 31 deletions(-)
>  create mode 100644 oidmap.c
>  create mode 100644 oidmap.h
> 
> diff --git a/Makefile b/Makefile
> index ed4ca438b..64136dde4 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -821,6 +821,7 @@ LIB_OBJS += notes-cache.o
>  LIB_OBJS += notes-merge.o
>  LIB_OBJS += notes-utils.o
>  LIB_OBJS += object.o
> +LIB_OBJS += oidmap.o
>  LIB_OBJS += oidset.o
>  LIB_OBJS += packfile.o
>  LIB_OBJS += pack-bitmap.o
> diff --git a/fetch-pack.c b/fetch-pack.c
> index 105506e9a..008b25d3d 100644
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -611,7 +611,7 @@ static int tip_oids_contain(struct oidset *tip_oids,
>  	 * add to "newlist" between calls, the additions will always be for
>  	 * oids that are already in the set.
>  	 */
> -	if (!tip_oids->map.tablesize) {
> +	if (!tip_oids->map.map.tablesize) {
>  		add_refs_to_oidset(tip_oids, unmatched);
>  		add_refs_to_oidset(tip_oids, newlist);
>  	}
> diff --git a/oidmap.c b/oidmap.c
> new file mode 100644
> index 000000000..6db4fffcd
> --- /dev/null
> +++ b/oidmap.c
> @@ -0,0 +1,51 @@
> +#include "cache.h"
> +#include "oidmap.h"
> +
> +static int cmpfn(const void *hashmap_cmp_fn_data,
> +		 const void *entry, const void *entry_or_key,
> +		 const void *keydata)
> +{
> +	const struct oidmap_entry *entry_ = entry;
> +	if (keydata)
> +		return oidcmp(&entry_->oid, (const struct object_id *) keydata);
> +	return oidcmp(&entry_->oid,
> +		      &((const struct oidmap_entry *) entry_or_key)->oid);
> +}
> +
> +static int hash(const struct object_id *oid)
> +{
> +	int hash;
> +	memcpy(&hash, oid->hash, sizeof(hash));
> +	return hash;
> +}
> +
> +void oidmap_init(struct oidmap *map, size_t initial_size)
> +{
> +	hashmap_init(&map->map, cmpfn, NULL, initial_size);
> +}
> +
> +void oidmap_free(struct oidmap *map, int free_entries)
> +{
> +	if (!map)
> +		return;
> +	hashmap_free(&map->map, free_entries);
> +}
> +
> +void *oidmap_get(const struct oidmap *map, const struct object_id *key)
> +{
> +	return hashmap_get_from_hash(&map->map, hash(key), key);
> +}
> +
> +void *oidmap_remove(struct oidmap *map, const struct object_id *key)
> +{
> +	struct hashmap_entry entry;
> +	hashmap_entry_init(&entry, hash(key));
> +	return hashmap_remove(&map->map, &entry, key);
> +}
> +
> +void *oidmap_put(struct oidmap *map, void *entry)
> +{
> +	struct oidmap_entry *to_put = entry;
> +	hashmap_entry_init(&to_put->internal_entry, hash(&to_put->oid));
> +	return hashmap_put(&map->map, to_put);
> +}
> diff --git a/oidmap.h b/oidmap.h
> new file mode 100644
> index 000000000..18f54cde1
> --- /dev/null
> +++ b/oidmap.h
> @@ -0,0 +1,68 @@
> +#ifndef OIDMAP_H
> +#define OIDMAP_H
> +
> +#include "hashmap.h"
> +
> +/*
> + * struct oidmap_entry is a structure representing an entry in the hash table,
> + * which must be used as first member of user data structures.
> + *
> + * Users should set the oid field. oidmap_put() will populate the
> + * internal_entry field.
> + */
> +struct oidmap_entry {
> +	/* For internal use only */
> +	struct hashmap_entry internal_entry;
> +
> +	struct object_id oid;
> +};
> +
> +struct oidmap {
> +	struct hashmap map;
> +};
> +
> +#define OIDMAP_INIT { { NULL } }
> +
> +/*
> + * Initializes an oidmap structure.
> + *
> + * `map` is the oidmap to initialize.
> + *
> + * If the total number of entries is known in advance, the `initial_size`
> + * parameter may be used to preallocate a sufficiently large table and thus
> + * prevent expensive resizing. If 0, the table is dynamically resized.
> + */
> +extern void oidmap_init(struct oidmap *map, size_t initial_size);
> +
> +/*
> + * Frees an oidmap structure and allocated memory.
> + *
> + * If `free_entries` is true, each oidmap_entry in the map is freed as well
> + * using stdlibs free().
> + */
> +extern void oidmap_free(struct oidmap *map, int free_entries);
> +
> +/*
> + * Returns the oidmap entry for the specified oid, or NULL if not found.
> + */
> +extern void *oidmap_get(const struct oidmap *map,
> +			const struct object_id *key);
> +
> +/*
> + * Adds or replaces an oidmap entry.
> + *
> + * ((struct oidmap_entry *) entry)->internal_entry will be populated by this
> + * function.
> + *
> + * Returns the replaced entry, or NULL if not found (i.e. the entry was added).
> + */
> +extern void *oidmap_put(struct oidmap *map, void *entry);
> +
> +/*
> + * Removes an oidmap entry matching the specified oid.
> + *
> + * Returns the removed entry, or NULL if not found.
> + */
> +extern void *oidmap_remove(struct oidmap *map, const struct object_id *key);
> +
> +#endif
> diff --git a/oidset.c b/oidset.c
> index a6a08ba52..f1f874aaa 100644
> --- a/oidset.c
> +++ b/oidset.c
> @@ -1,50 +1,30 @@
>  #include "cache.h"
>  #include "oidset.h"
>  
> -struct oidset_entry {
> -	struct hashmap_entry hash;
> -	struct object_id oid;
> -};
> -
> -static int oidset_hashcmp(const void *unused_cmp_data,
> -			  const void *va, const void *vb,
> -			  const void *vkey)
> -{
> -	const struct oidset_entry *a = va, *b = vb;
> -	const struct object_id *key = vkey;
> -	return oidcmp(&a->oid, key ? key : &b->oid);
> -}
> -
>  int oidset_contains(const struct oidset *set, const struct object_id *oid)
>  {
> -	struct hashmap_entry key;
> -
> -	if (!set->map.cmpfn)
> +	if (!set->map.map.tablesize)
>  		return 0;
> -
> -	hashmap_entry_init(&key, sha1hash(oid->hash));
> -	return !!hashmap_get(&set->map, &key, oid);
> +	return !!oidmap_get(&set->map, oid);
>  }
>  
>  int oidset_insert(struct oidset *set, const struct object_id *oid)
>  {
> -	struct oidset_entry *entry;
> -
> -	if (!set->map.cmpfn)
> -		hashmap_init(&set->map, oidset_hashcmp, NULL, 0);
> +	struct oidmap_entry *entry;
>  
> -	if (oidset_contains(set, oid))
> +	if (!set->map.map.tablesize)
> +		oidmap_init(&set->map, 0);
> +	else if (oidset_contains(set, oid))
>  		return 1;
>  
>  	entry = xmalloc(sizeof(*entry));
> -	hashmap_entry_init(&entry->hash, sha1hash(oid->hash));
>  	oidcpy(&entry->oid, oid);
>  
> -	hashmap_add(&set->map, entry);
> +	oidmap_put(&set->map, entry);
>  	return 0;
>  }
>  
>  void oidset_clear(struct oidset *set)
>  {
> -	hashmap_free(&set->map, 1);
> +	oidmap_free(&set->map, 1);
>  }
> diff --git a/oidset.h b/oidset.h
> index b7eaab5b8..f4c9e0f9c 100644
> --- a/oidset.h
> +++ b/oidset.h
> @@ -1,6 +1,8 @@
>  #ifndef OIDSET_H
>  #define OIDSET_H
>  
> +#include "oidmap.h"
> +
>  /**
>   * This API is similar to sha1-array, in that it maintains a set of object ids
>   * in a memory-efficient way. The major differences are:
> @@ -17,10 +19,10 @@
>   * A single oidset; should be zero-initialized (or use OIDSET_INIT).
>   */
>  struct oidset {
> -	struct hashmap map;
> +	struct oidmap map;
>  };
>  
> -#define OIDSET_INIT { { NULL } }
> +#define OIDSET_INIT { OIDMAP_INIT }
>  
>  /**
>   * Returns true iff `set` contains `oid`.
> -- 
> 2.14.2.822.g60be5d43e6-goog
> 

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] oidmap: map with OID as key
  2017-10-02 23:48   ` Brandon Williams
@ 2017-10-03  6:31     ` Jeff King
  2017-10-04  0:29       ` Jonathan Tan
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff King @ 2017-10-03  6:31 UTC (permalink / raw)
  To: Brandon Williams; +Cc: Jonathan Tan, git, gitster

On Mon, Oct 02, 2017 at 04:48:48PM -0700, Brandon Williams wrote:

> > Some replies to v1 [1] [2] seem to indicate that simpler non-duplicated
> > code should be preferred over optimizing away the storage of the 4-byte
> > hash code, and I have no objection to that, so I have updated this code
> > to be a thin wrapper over hashmap with the 4-byte overhead.
> > 
> > After this patch, if the 4-byte overhead is found to be too much, we can
> > migrate to something similar to v1 relatively easily.
> > 
> > I decided not to go with the util pointer method because we will not be
> > able to migrate away from it so easily if need be.
> 
> This makes me a bit sad because I tend to lean more towards making
> things simpler.  I'm still a supporter of the 'util' pointer but I
> understand why we would choose otherwise.

Right, I kind of wonder if this has fallen into an uncanny value where
we have this almost-hashmap infrastructure, but the end result is not
significantly easier to use than a plain-old hashmap.

I.e., it looks like you still have to declare something like:

  struct my_data {
        struct oidmap_entry oid;
	int value; /* mapping to an int */
  };

and handle the allocation of the entry yourself. If we instead just
adding an oidhash() and oidcmpfn(), then callers could those directly.

The invocations are a _little_ longer with a raw hashmap, but not much
(as you can see from the actual oidmap implementation, and the changes
to oidset).

I dunno. I'm not against it per se. The API _is_ a little nicer, but I
just wonder if there's a downside to even the thin wrapper, in that
callers are no longer free to use other parts of the hashmap API. If I
saw some converted callers I might be more convinced.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] oidmap: map with OID as key
  2017-10-03  6:31     ` Jeff King
@ 2017-10-04  0:29       ` Jonathan Tan
  2017-10-04  7:45         ` Jeff King
  0 siblings, 1 reply; 16+ messages in thread
From: Jonathan Tan @ 2017-10-04  0:29 UTC (permalink / raw)
  To: Jeff King; +Cc: Brandon Williams, Git mailing list, Junio C Hamano

On Mon, Oct 2, 2017 at 11:31 PM, Jeff King <peff@peff.net> wrote:
> Right, I kind of wonder if this has fallen into an uncanny value where
> we have this almost-hashmap infrastructure, but the end result is not
> significantly easier to use than a plain-old hashmap.
>
> I.e., it looks like you still have to declare something like:
>
>   struct my_data {
>         struct oidmap_entry oid;
>         int value; /* mapping to an int */
>   };
>
> and handle the allocation of the entry yourself. If we instead just
> adding an oidhash() and oidcmpfn(), then callers could those directly.

I thought of something like that, but it seems that you have to
remember quite a few things:
- your entry must have "struct oidmap_entry" at the start, not "struct
hashmap_entry"
- initialize your hashmap with oidcmpfn()
- when getting, hashmap_get_from_hash(map, oidhash(&oid), &oid) (and
oid might be longer e.g. ref->old_oid)

> The invocations are a _little_ longer with a raw hashmap, but not much
> (as you can see from the actual oidmap implementation, and the changes
> to oidset).

About the invocation of hashmap_get_from_hash(), I felt that it would
get annoying quickly enough that I would want an oidmap_get(const
struct hashmap *, const struct object_id *) but it might be strange
that the "get" method is named differently from the rest. If we
tolerate oidmap_get(), and tolerate the fact that the user must both
declare "struct oidmap_entry" instead of "struct hashmap_entry" and
initialize the hashmap with oidcmpfn() (so that the invocation to
hashmap_get_from_hash() within oidmap_get() sends the correct
keydata), we can avoid the thin wrapper issue where callers can no
longer use other methods of hashmap. At this point I decided that I
prefer the thin wrapper, but the "light touch" (struct oidmap_entry,
oidcmpfn(), oidmap_get() only) still better than the status quo.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] oidmap: map with OID as key
  2017-10-04  0:29       ` Jonathan Tan
@ 2017-10-04  7:45         ` Jeff King
  2017-10-04  8:48           ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff King @ 2017-10-04  7:45 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Brandon Williams, Git mailing list, Junio C Hamano

On Tue, Oct 03, 2017 at 05:29:01PM -0700, Jonathan Tan wrote:

> At this point I decided that I prefer the thin wrapper, but the "light
> touch" (struct oidmap_entry, oidcmpfn(), oidmap_get() only) still
> better than the status quo.

OK. I can certainly live with that. And worst case, I suppose, is that a
caller wants some underlying hashmap function and we just have to extend
the oidmap API to include it. It's not like we're adding new hashmap
functions willy-nilly.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] oidmap: map with OID as key
  2017-10-04  7:45         ` Jeff King
@ 2017-10-04  8:48           ` Junio C Hamano
  0 siblings, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2017-10-04  8:48 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, Brandon Williams, Git mailing list

Jeff King <peff@peff.net> writes:

> On Tue, Oct 03, 2017 at 05:29:01PM -0700, Jonathan Tan wrote:
>
>> At this point I decided that I prefer the thin wrapper, but the "light
>> touch" (struct oidmap_entry, oidcmpfn(), oidmap_get() only) still
>> better than the status quo.
>
> OK. I can certainly live with that. And worst case, I suppose, is that a
> caller wants some underlying hashmap function and we just have to extend
> the oidmap API to include it. It's not like we're adding new hashmap
> functions willy-nilly.

OK, I think I can live with that, too.  I'll tentatively mark the
topic to be merged to 'next' but give it for a few days so that
others can stop me.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-10-04  8:48 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-27 22:19 [PATCH] oidmap: map with OID as key Jonathan Tan
2017-09-28  0:41 ` Brandon Williams
2017-09-28 17:46   ` Jonathan Tan
2017-09-28 20:05     ` Jeff King
2017-09-29 19:04       ` Jonathan Tan
2017-09-29 19:26         ` Jeff King
2017-09-29 21:43       ` Johannes Schindelin
2017-09-29 23:24         ` Jeff King
2017-09-28  3:13 ` Junio C Hamano
2017-09-28 17:38   ` Jonathan Tan
2017-09-29 22:54 ` [PATCH v2] " Jonathan Tan
2017-10-02 23:48   ` Brandon Williams
2017-10-03  6:31     ` Jeff King
2017-10-04  0:29       ` Jonathan Tan
2017-10-04  7:45         ` Jeff King
2017-10-04  8:48           ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).