git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Jakub Narebski" <jnareb@gmail.com>, "Ted Ts'o" <tytso@mit.edu>,
	"Jonathan Nieder" <jrnieder@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Clemens Buchacher" <drizzd@aon.at>
Subject: [PATCH 2/5] add object-cache infrastructure
Date: Mon, 11 Jul 2011 12:17:54 -0400	[thread overview]
Message-ID: <20110711161754.GB10418@sigill.intra.peff.net> (raw)
In-Reply-To: <20110711161332.GA10057@sigill.intra.peff.net>

There is sometimes a need to cache some information about an
object or set of objects persistently across git
invocations. The notes-cache interface can be used for this,
but it is very heavyweight and slow for storing small
values.

This patch introduces a new API, object-cache, which stores
a mapping of objects to values in a concise and efficient
form. See the added API documentation for details.

Signed-off-by: Jeff King <peff@peff.net>
---
As I mentioned earlier, I wanted this to be generic and size-agnostic,
because I'd also like to try caching patch-ids for git-cherry.

 Documentation/technical/api-decorate.txt     |    3 +
 Documentation/technical/api-object-cache.txt |  101 +++++++++++++
 Makefile                                     |    2 +
 object-cache.c                               |  204 ++++++++++++++++++++++++++
 object-cache.h                               |   32 ++++
 5 files changed, 342 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/technical/api-object-cache.txt
 create mode 100644 object-cache.c
 create mode 100644 object-cache.h

diff --git a/Documentation/technical/api-decorate.txt b/Documentation/technical/api-decorate.txt
index 8d10b66..c851f59 100644
--- a/Documentation/technical/api-decorate.txt
+++ b/Documentation/technical/api-decorate.txt
@@ -13,6 +13,9 @@ could, for example, map objects into 32-bit integers. For ease of use,
 functions are provided for storing the values of arbitrary pointers,
 which can point to strings or structs.
 
+Note that the decorate API only stores the mapping in memory. See the
+object-cache API for persistent storage.
+
 Data Structures
 ---------------
 
diff --git a/Documentation/technical/api-object-cache.txt b/Documentation/technical/api-object-cache.txt
new file mode 100644
index 0000000..ecead4c
--- /dev/null
+++ b/Documentation/technical/api-object-cache.txt
@@ -0,0 +1,101 @@
+Object Cache API
+================
+
+The object cache API is meant to store a cache of per-object values. The
+aim is for robustness, speed, and simplicity. The stored data must be of
+a fixed size, and the only operations allowed are insertion and
+retrieval with a struct object as the key.
+
+This API is similar to the decorate API, but provides persistence of
+values across multiple invocations. It is also similar to the
+notes-cache API, but is much lighter weight and suitable for storing
+small values. If you are storing large, arbitrary data, consider using
+notes-cache.
+
+
+Storage
+-------
+
+Values are stored both on-disk and in-memory. Newly added values are
+initially stored in a hash table in memory, and written to disk only
+when `object_cache_write` is called.
+
+The disk storage consists of a single file per cache, located in the
+`$GIT_DIR/cache` directory. Each file contains a list of rows ordered by
+sha1. Each row contains a binary sha1, followed by the fixed-size data
+mapped to that sha1.
+
+When the cache is written to disk, the contents of the in-memory data
+and the disk data are merged, with in-memory values taking precedence
+over disk values. The data is written to a temporary file and atomically
+renamed into the new cache file. Thus there is no lock contention
+between competing processes on either reading or writing (though one
+process's updates may be lost).
+
+
+Speed
+-----
+
+Lookup in the cache requires `O(lg(n))` hash comparisons (via binary
+search of the disk contents, or the in-memory hash table).
+
+Insertion into the cache is amortized `O(1)` via the hash table. Writing
+the cache to disk entails `O(n*lg(n) + m)` hash comparisons, where `m`
+is the number of existing disk entries and `n` is the number of newly
+added entries.
+
+
+Data Structures
+---------------
+
+`struct object_cache`::
+
+	This structure represents a single object cache. Its contents
+	should be considered opaque by calling code.
++
+	The `fd`, `disk_entries`, `disk_nr`, and `record_size` fields
+	represent the mmap'd on-disk data. The `mem` field represents
+	the in-memory data (see the decorate API for details).
+
+
+Functions
+---------
+
+`object_cache_init`::
+
+	Initialize a new object cache; this must be called before any
+	other functions are called on the cache. The `name` parameter
+	specifies a human-readable name which will be used for storage
+	in `$GIT_DIR/cache/$name`. The `width` parameter specifies the
+	size, in units of `char`, of the data to be stored (e.g., pass
+	in `sizeof(uint32_t)` to store 32-bit integers).
+
+`object_cache_lookup`::
+
+	Retrieve a value from the object cache. A void pointer to the
+	stored value will be returned, or `NULL` if there is no value.
+
+`object_cache_add`::
+
+	Store a value in the object cache. The value pointer should
+	point to exactly `width` bytes of data.
+
+`object_cache_write`::
+
+	Merge the in-memory and existing disk entries, and write the
+	resulting cache to disk, under the name `$GIT_DIR/cache/$name`.
+	If no in-memory entries exist, the on-disk cache will be left
+	unchanged.
+
+`object_cache_lookup_uint32`::
+
+	Convenience wrapper for retrieving unsigned 32-bit integers. The
+	value will be returned via the `value` pointer. The return value
+	is `0` if a value was found, or negative otherwise (in which
+	case the contents of `value` will be unchanged).
+
+`object_cache_add_uint32`::
+
+	Convenience wrapper for storing unsigned 32-bit integers. Note
+	that integers are stored on disk in network-byte order, so it is
+	safe to access caches from any architecture.
diff --git a/Makefile b/Makefile
index f8c72e1..dd05579 100644
--- a/Makefile
+++ b/Makefile
@@ -535,6 +535,7 @@ LIB_H += merge-recursive.h
 LIB_H += notes.h
 LIB_H += notes-cache.h
 LIB_H += notes-merge.h
+LIB_H += object-cache.h
 LIB_H += object.h
 LIB_H += pack.h
 LIB_H += pack-refs.h
@@ -628,6 +629,7 @@ LIB_OBJS += name-hash.o
 LIB_OBJS += notes.o
 LIB_OBJS += notes-cache.o
 LIB_OBJS += notes-merge.o
+LIB_OBJS += object-cache.o
 LIB_OBJS += object.o
 LIB_OBJS += pack-check.o
 LIB_OBJS += pack-refs.o
diff --git a/object-cache.c b/object-cache.c
new file mode 100644
index 0000000..d019f0a
--- /dev/null
+++ b/object-cache.c
@@ -0,0 +1,204 @@
+#include "cache.h"
+#include "object-cache.h"
+#include "sha1-lookup.h"
+#include "object.h"
+
+static const char *object_cache_path(const char *name)
+{
+	return git_path("cache/%s", name);
+}
+
+static void open_disk_cache(struct object_cache *c, const char *path)
+{
+	struct stat sb;
+
+	c->fd = open(path, O_RDONLY);
+	if (c->fd < 0)
+		return;
+
+	if (fstat(c->fd, &sb) < 0) {
+		close(c->fd);
+		c->fd = -1;
+		return;
+	}
+
+	c->disk_entries = xmmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE,
+				c->fd, 0);
+	c->disk_nr = sb.st_size / c->record_size;
+}
+
+void object_cache_init(struct object_cache *c, const char *name, int width)
+{
+	memset(c, 0, sizeof(*c));
+	c->record_size = 20 + width;
+	c->mem.width = width;
+	open_disk_cache(c, object_cache_path(name));
+}
+
+static void *lookup_disk(const struct object_cache *c,
+			 const struct object *obj)
+{
+	int pos;
+
+	pos = sha1_entry_pos(c->disk_entries, c->record_size, 0,
+			     0, c->disk_nr, c->disk_nr, obj->sha1);
+	if (pos < 0)
+		return NULL;
+
+	return c->disk_entries + (pos * c->record_size) + 20;
+}
+
+const void *object_cache_lookup(const struct object_cache *c,
+					 const struct object *obj)
+{
+	void *r;
+
+	r = lookup_decoration_value(&c->mem, obj);
+	if (!r)
+		r = lookup_disk(c, obj);
+	return r;
+}
+
+void object_cache_add(struct object_cache *c, const struct object *obj,
+		      const void *value)
+{
+	add_decoration_value(&c->mem, obj, value, NULL);
+}
+
+static unsigned char *flatten_mem_entries(struct object_cache *c)
+{
+	unsigned char *p;
+	unsigned char *ret;
+	int nr;
+
+	ret = xmalloc(c->mem.nr * c->record_size);
+	nr = 0;
+	for (p = c->mem.hash; p < c->mem.end; p += c->mem.stride) {
+		struct object_decoration *e = (struct object_decoration *)p;
+		unsigned char *out;
+
+		if (!e->base)
+			continue;
+
+		if (nr == c->mem.nr)
+			die("BUG: decorate hash contained extra values");
+
+		out = ret + (nr * c->record_size);
+		hashcpy(out, e->base->sha1);
+		out += 20;
+		memcpy(out, e->decoration, c->mem.width);
+		nr++;
+	}
+
+	return ret;
+}
+
+static int void_hashcmp(const void *a, const void *b)
+{
+	return hashcmp(a, b);
+}
+
+int object_cache_write(struct object_cache *c, const char *name)
+{
+	const char *path = object_cache_path(name);
+	struct strbuf tempfile = STRBUF_INIT;
+	int fd;
+	unsigned char *mem_entries;
+	int i, j;
+	int r;
+
+	if (!c->mem.nr)
+		return 0;
+
+	strbuf_addf(&tempfile, "%s.XXXXXX", path);
+	if (safe_create_leading_directories(tempfile.buf) < 0 ||
+	    (fd = git_mkstemp_mode(tempfile.buf, 0755)) < 0) {
+		strbuf_release(&tempfile);
+		return -1;
+	}
+
+	mem_entries = flatten_mem_entries(c);
+	qsort(mem_entries, c->mem.nr, c->record_size, void_hashcmp);
+
+#define WRITE_ENTRY(e) \
+do { \
+	if (write_in_full(fd, e, c->record_size) < 0) { \
+		close(fd); \
+		unlink(tempfile.buf); \
+		strbuf_release(&tempfile); \
+		free(mem_entries); \
+		return -1; \
+	} \
+} while (0)
+
+	i = j = 0;
+	while (i < c->mem.nr && j < c->disk_nr) {
+		unsigned char *mem = mem_entries + (i * c->record_size);
+		unsigned char *disk = c->disk_entries + (j * c->record_size);
+		int cmp = hashcmp(mem, disk);
+
+		/* replace disk entries with in-memory ones */
+		if (cmp == 0)
+			j++;
+		else if (cmp < 0) {
+			WRITE_ENTRY(mem);
+			i++;
+		}
+		else {
+			WRITE_ENTRY(disk);
+			j++;
+		}
+	}
+	while (i < c->mem.nr) {
+		unsigned char *mem = mem_entries + (i * c->record_size);
+		WRITE_ENTRY(mem);
+		i++;
+	}
+	while (j < c->disk_nr) {
+		unsigned char *disk = c->disk_entries + (j * c->record_size);
+		WRITE_ENTRY(disk);
+		j++;
+	}
+#undef WRITE_ENTRY
+
+	free(mem_entries);
+
+	if (close(fd) < 0 || rename(tempfile.buf, path) < 0)
+		r = -1;
+	else
+		r = 0;
+
+	unlink(tempfile.buf);
+	strbuf_release(&tempfile);
+	return r;
+}
+
+int object_cache_lookup_uint32(const struct object_cache *c,
+			       const struct object *obj,
+			       uint32_t *value)
+{
+	const uint32_t *out;
+
+	if (c->record_size != 24)
+		die("BUG: size mismatch in object cache lookup (%d != 24)",
+		    c->record_size);
+
+	out = object_cache_lookup(c, obj);
+	if (!out)
+		return -1;
+
+	*value = ntohl(*out);
+	return 0;
+}
+
+void object_cache_add_uint32(struct object_cache *c,
+			     const struct object *obj,
+			     uint32_t value)
+{
+	if (c->record_size != 24)
+		die("BUG: size mismatch in object cache add (%d != 24)",
+		    c->record_size);
+
+	value = htonl(value);
+	object_cache_add(c, obj, &value);
+}
diff --git a/object-cache.h b/object-cache.h
new file mode 100644
index 0000000..b869f1e
--- /dev/null
+++ b/object-cache.h
@@ -0,0 +1,32 @@
+#ifndef OBJECT_CACHE_H
+#define OBJECT_CACHE_H
+
+#include "decorate.h"
+
+struct object_cache {
+	/* mmap'd disk entries */
+	int fd;
+	unsigned char *disk_entries;
+	int disk_nr;
+	int record_size;
+
+	/* in memory entries */
+	struct decoration mem;
+};
+
+void object_cache_init(struct object_cache *, const char *name, int width);
+const void *object_cache_lookup(const struct object_cache *,
+				const struct object *);
+void object_cache_add(struct object_cache *, const struct object *,
+		      const void *value);
+int object_cache_write(struct object_cache *, const char *name);
+
+/* Convenience wrappers around object_cache_{lookup,add} */
+int object_cache_lookup_uint32(const struct object_cache *,
+			       const struct object *,
+			       uint32_t *value);
+void object_cache_add_uint32(struct object_cache *,
+			     const struct object *,
+			     uint32_t value);
+
+#endif /* OBJECT_CACHE_H */
-- 
1.7.6.37.g989c6

  parent reply	other threads:[~2011-07-11 16:18 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-11 16:13 [RFC/PATCH 0/5] generation numbers for faster traversals Jeff King
2011-07-11 16:16 ` [PATCH 1/5] decorate: allow storing values instead of pointers Jeff King
2011-07-11 16:39   ` Jeff King
2011-07-11 19:06   ` Junio C Hamano
2011-07-11 21:20     ` Jeff King
2011-07-11 16:17 ` Jeff King [this message]
2011-07-11 16:46   ` [PATCH 2/5] add object-cache infrastructure Jeff King
2011-07-11 16:58     ` Shawn Pearce
2011-07-11 19:17   ` Junio C Hamano
2011-07-11 22:01     ` Jeff King
2011-07-11 23:21       ` Junio C Hamano
2011-07-11 23:42         ` Junio C Hamano
2011-07-12  0:03         ` Jeff King
2011-07-12 19:38           ` Clemens Buchacher
2011-07-12 19:45             ` Jeff King
2011-07-12 21:07               ` Clemens Buchacher
2011-07-12 21:15                 ` Clemens Buchacher
2011-07-12 21:36                 ` Jeff King
2011-07-14  8:04                   ` Clemens Buchacher
2011-07-14 16:26                     ` Illia Bobyr
2011-07-13  1:33                 ` John Szakmeister
2011-07-12  0:14         ` Illia Bobyr
2011-07-12  5:35           ` Jeff King
2011-07-12 21:52             ` Illia Bobyr
2011-07-12  6:36       ` Miles Bader
2011-07-12 10:41   ` Jakub Narebski
2011-07-12 17:57     ` Jeff King
2011-07-12 18:41       ` Junio C Hamano
2011-07-13  6:37         ` Jeff King
2011-07-13 17:49           ` Junio C Hamano
2011-07-11 16:18 ` [PATCH 3/5] commit: add commit_generation function Jeff King
2011-07-11 17:57   ` Clemens Buchacher
2011-07-11 21:10     ` Jeff King
2011-07-11 16:18 ` [PATCH 4/5] pretty: support %G to show the generation number of a commit Jeff King
2011-07-11 16:18 ` [PATCH 5/5] limit "contains" traversals based on commit generation Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110711161754.GB10418@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=drizzd@aon.at \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=jrnieder@gmail.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).