[RFC/PATCH 0/5] generation numbers for faster traversals

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [RFC/PATCH 0/5] generation numbers for faster traversals
@ 2011-07-11 16:13 Jeff King
  2011-07-11 16:16 ` [PATCH 1/5] decorate: allow storing values instead of pointers Jeff King
                   ` (4 more replies)
  0 siblings, 5 replies; 35+ messages in thread
From: Jeff King @ 2011-07-11 16:13 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher, git

So here's my series for using cached generations with "tag --contains".
I already posted some numbers here:

  http://article.gmane.org/gmane.comp.version-control.git/176807

But to recap, this seems to be about as fast as limiting by timestamp,
but without the skew issues. It uses 24 bytes per commit of disk
storage for the cache (about 6 megabytes for linux-2.6).

If we like this approach, we could do something similar for "branch
--contains", and possibly move "name-rev" to use generations instead of
timestamps with a timestamp slop.

These patches are built on top of ffc4b80 (tag: speed up --contains
calculation, 2011-06-11), the first commit from jk/tag-contains-ab.

  [1/5]: decorate: allow storing values instead of pointers
  [2/5]: add object-cache infrastructure
  [3/5]: commit: add commit_generation function
  [4/5]: pretty: support %G to show the generation number of a commit
  [5/5]: limit "contains" traversals based on commit generation

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/5] decorate: allow storing values instead of pointers
  2011-07-11 16:13 [RFC/PATCH 0/5] generation numbers for faster traversals Jeff King
@ 2011-07-11 16:16 ` Jeff King
  2011-07-11 16:39   ` Jeff King
  2011-07-11 19:06   ` Junio C Hamano
  2011-07-11 16:17 ` [PATCH 2/5] add object-cache infrastructure Jeff King
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 35+ messages in thread
From: Jeff King @ 2011-07-11 16:16 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher

The decorate API provides a mapping of objects to arbitrary
values. Until now, it did this by allowing you to store a
void pointer, which could point to other storage. This has
two problems:

  1. It's inefficient. To store even a small value, you have
     to allocate persistent storage. So we end up storing
     both the value and a pointer to it.

  2. It's a pain to use. To avoid heap overhead for small
     values, you have to either use a custom allocater, or
     you have to shoe-horn the value into the void pointer
     (if it's small enough to fit).

This patch lets you store fixed-size values directly in the
hash table without allocating them elsewhere. This is a
definite win for any value smaller than or equal to a
pointer. It's probably a win for slightly larger values, but
may eventually be slower for storing large structs (because
the values are copied when the hash grows).

It also provides a more natural interface for storing small
values; see the changes in fast-export.c, which can now drop
its pointer arithmetic magic.

The original "store and retrieve a void pointer" API is easy
to implement on top of this (the values we store are the
pointers). The add_decoration and lookup_decoration
functions are kept for compatibility.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/technical/api-decorate.txt |  170 +++++++++++++++++++++++++++++-
 builtin/fast-export.c                    |   32 ++----
 decorate.c                               |  102 ++++++++++++-------
 decorate.h                               |   22 ++++-
 4 files changed, 263 insertions(+), 63 deletions(-)

diff --git a/Documentation/technical/api-decorate.txt b/Documentation/technical/api-decorate.txt
index 1d52a6c..8d10b66 100644
--- a/Documentation/technical/api-decorate.txt
+++ b/Documentation/technical/api-decorate.txt
@@ -1,6 +1,172 @@
 decorate API
 ============
 
-Talk about <decorate.h>
+The decorate API is a system for efficiently mapping objects to values
+in memory. It is slightly slower than an actual member of an object
+struct (because it incurs a hash lookup), but it uses less memory when
+the mapping is not in use, or when the number of decorated objects is
+small compared to the total number of objects.
 
-(Linus)
+For efficiency, the mapping is capable of storing actual byte values, as
+long as the byte values for each element are of a fixed size. So one
+could, for example, map objects into 32-bit integers. For ease of use,
+functions are provided for storing the values of arbitrary pointers,
+which can point to strings or structs.
+
+Data Structures
+---------------
+
+`struct decoration`::
+
+	This structure represents a single mapping of objects to
+	values. Its fields are:
+
+	`name`:::
+		This field is not used by the decorate API itself, but
+		may be used by calling code.
+
+	`width`:::
+		This field specifies the width (in units of `char`) of
+		the values to be stored. This field must be set to its
+		final value before any objects are added to the mapping.
+		A width of `0` is equivalent to `sizeof(void *)`.
+
+	`nr`:::
+		The number of objects currently mapped by the
+		decoration.
+
+	`size`:::
+		The number of hash slots allocated; this is kept to at
+		least 3/2 of the number of actual slots used, to keep
+		the hash sparse.
+
+	`hash`:::
+		A pointer to an array of actual `object_decoration`
+		structs. Note that because the width of `struct
+		object_decoration` is not known until runtime, this
+		array is stored with type `unsigned char *`. To access
+		individual items, one must perform pointer arithmetic
+		using the `stride` field (see the description of
+		`stride` below, or the "Examples" section).
+
+	`stride`:::
+		The size of a single object_decoration record.
+		Technically this can be computed as a function of the
+		width, but it is here for ease-of-use.
+
+	`end`:::
+		A pointer to storage just past the end of the `hash`
+		array. This can be used as a boundary for iterating the
+		items in the hash (see "Examples" below). Technically
+		this can be computed as a function of the hash pointer,
+		the size field, and the stride, but is here for
+		ease-of-use.
+
+`struct object_decoration`::
+
+	A structure representing the decoration of a single object.
+	Callers will not normally need to use this object unless they
+	are iterating all elements in the decoration hash. The `base`
+	field points to the object being mapped (or `NULL` if it is
+	an empty hash slot). The `decoration` field stores the mapped
+	value as a sequence of bytes; use the `width` field in `struct
+	decoration` to know the exact size.
+
+
+Functions
+---------
+
+`add_decoration_value`::
+
+	Add a mapping from an object to a sequence of bytes. The number
+	of bytes pointed to by `decoration` should be equal to the
+	`width` field of the `struct decoration`. If the `old` parameter
+	is not NULL and a there was already a value for the object, the
+	bytes of the old value are copied into `old`.  The return value
+	is `1` if there was a previous value, or `0` otherwise. Note
+	that if there is no previous value, then `old` is left
+	untouched; it is the responsibility of the caller to either
+	check the return value or to set a sentinel value in `old`.
+
+`lookup_decoration_value`::
+
+	Retrieve a decoration from the mapping. The return value is a
+	pointer to the sequence of bytes representing the value (of
+	length `width`), or `NULL` if no value is found.
+
+`add_decoration`::
+
+	Add a mapping from an object to a void pointer. If a previous
+	pointer exists for the object, it is returned; otherwise, `NULL`
+	is returned.
+
+`lookup_decoration`::
+
+	Retrieve a void pointer from the mapping. The return value is
+	the stored pointer, or `NULL` if there is no stored pointer.
+
+
+Examples
+--------
+
+Store and retrieve pointers to structs:
+
+-------------------------------------------------------------------
+/* no need to set width parameter; it defaults to sizeof(void *) */
+static struct decoration commit_foos;
+
+void store_foo(const struct commit *c, const char *name)
+{
+	struct foo *value = alloc_foo(name);
+	struct foo *old;
+
+	old = add_decoration(&commit_foos, c->object, value);
+	free(old);
+}
+
+const struct foo *get_foo(const struct commit *c)
+{
+	return lookup_decoration(&commit_foos, c->object);
+}
+-------------------------------------------------------------------
+
+Store and retrieve `unsigned long` integers:
+
+-------------------------------------------------------------------
+static struct decoration longs = { "my longs", sizeof(unsigned long) };
+
+void store_long(const struct object *obj, unsigned long value)
+{
+	unsigned long old;
+	if (add_decoration_value(&longs, obj, &value, &old)
+		printf("old value was %lu\n", old);
+}
+
+void print_long(const struct object *obj)
+{
+	unsigned long *value = lookup_decoration_value(&longs, obj);
+	if (!value)
+		printf("no value\n");
+	else
+		printf("value is %lu\n", *value);
+}
+-------------------------------------------------------------------
+
+Iterate over all stored decorations:
+
+-------------------------------------------------------------------
+void dump_longs(void)
+{
+	unsigned char *p;
+	for (p = longs.hash; p < longs.end; p += longs.stride) {
+		struct object_decoration *e = (struct object_decoration *)p;
+		unsigned long *value = (unsigned long *)e->decoration;
+
+		/* empty hash slot */
+		if (!e->base)
+			continue;
+
+		printf("%s -> %lu\n", sha1_to_hex(e->base->sha1), *value);
+	}
+}
+-------------------------------------------------------------------
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index daf1945..c053f1f 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -59,7 +59,7 @@ static int parse_opt_tag_of_filtered_mode(const struct option *opt,
 	return 0;
 }
 
-static struct decoration idnums;
+static struct decoration idnums = { NULL, sizeof(uint32_t) };
 static uint32_t last_idnum;
 
 static int has_unshown_parent(struct commit *commit)
@@ -73,20 +73,9 @@ static int has_unshown_parent(struct commit *commit)
 	return 0;
 }
 
-/* Since intptr_t is C99, we do not use it here */
-static inline uint32_t *mark_to_ptr(uint32_t mark)
-{
-	return ((uint32_t *)NULL) + mark;
-}
-
-static inline uint32_t ptr_to_mark(void * mark)
-{
-	return (uint32_t *)mark - (uint32_t *)NULL;
-}
-
 static inline void mark_object(struct object *object, uint32_t mark)
 {
-	add_decoration(&idnums, object, mark_to_ptr(mark));
+	add_decoration_value(&idnums, object, &mark, NULL);
 }
 
 static inline void mark_next_object(struct object *object)
@@ -96,10 +85,10 @@ static inline void mark_next_object(struct object *object)
 
 static int get_object_mark(struct object *object)
 {
-	void *decoration = lookup_decoration(&idnums, object);
-	if (!decoration)
+	uint32_t *mark = lookup_decoration_value(&idnums, object);
+	if (!mark)
 		return 0;
-	return ptr_to_mark(decoration);
+	return *mark;
 }
 
 static void show_progress(void)
@@ -536,20 +525,19 @@ static void handle_tags_and_duplicates(struct string_list *extra_refs)
 
 static void export_marks(char *file)
 {
-	unsigned int i;
-	uint32_t mark;
-	struct object_decoration *deco = idnums.hash;
 	FILE *f;
 	int e = 0;
+	unsigned char *p;
 
 	f = fopen(file, "w");
 	if (!f)
 		die_errno("Unable to open marks file %s for writing.", file);
 
-	for (i = 0; i < idnums.size; i++) {
+	for (p = idnums.hash; p < idnums.end; p += idnums.stride) {
+		struct object_decoration *deco = (struct object_decoration *)p;
 		if (deco->base && deco->base->type == 1) {
-			mark = ptr_to_mark(deco->decoration);
-			if (fprintf(f, ":%"PRIu32" %s\n", mark,
+			uint32_t *mark = (uint32_t *)deco->decoration;
+			if (fprintf(f, ":%"PRIu32" %s\n", *mark,
 				sha1_to_hex(deco->base->sha1)) < 0) {
 			    e = 1;
 			    break;
diff --git a/decorate.c b/decorate.c
index 2f8a63e..55ebaee 100644
--- a/decorate.c
+++ b/decorate.c
@@ -6,6 +6,11 @@
 #include "object.h"
 #include "decorate.h"
 
+#define WIDTH(n) ((n)->width ? (n)->width : sizeof(void *))
+#define REF_OFFSET(n, i) ((n)->hash + ((i) * n->stride))
+#define REF_OBJECT(n, obj) (REF_OFFSET(n, hash_obj(obj, n->size)))
+#define OBJ(p) ((struct object_decoration *)(p))
+
 static unsigned int hash_obj(const struct object *obj, unsigned int n)
 {
 	unsigned int hash;
@@ -14,75 +19,100 @@ static unsigned int hash_obj(const struct object *obj, unsigned int n)
 	return hash % n;
 }
 
-static void *insert_decoration(struct decoration *n, const struct object *base, void *decoration)
+static int insert_decoration(struct decoration *n, const struct object *base,
+			     const void *decoration, void *old)
 {
-	int size = n->size;
-	struct object_decoration *hash = n->hash;
-	unsigned int j = hash_obj(base, size);
-
-	while (hash[j].base) {
-		if (hash[j].base == base) {
-			void *old = hash[j].decoration;
-			hash[j].decoration = decoration;
-			return old;
+	unsigned long width = WIDTH(n);
+	unsigned char *p = REF_OBJECT(n, base);
+
+	while (OBJ(p)->base) {
+		struct object_decoration *e = OBJ(p);
+		if (e->base == base) {
+			if (old)
+				memcpy(old, e->decoration, width);
+			memcpy(e->decoration, decoration, width);
+			return 1;
 		}
-		if (++j >= size)
-			j = 0;
+		p += n->stride;
+		if (p >= n->end)
+			p = n->hash;
 	}
-	hash[j].base = base;
-	hash[j].decoration = decoration;
+	OBJ(p)->base = base;
+	memcpy(OBJ(p)->decoration, decoration, width);
 	n->nr++;
-	return NULL;
+	return 0;
 }
 
 static void grow_decoration(struct decoration *n)
 {
-	int i;
-	int old_size = n->size;
-	struct object_decoration *old_hash = n->hash;
+	unsigned char *old_hash = n->hash;
+	unsigned char *old_end = n->end;
+	unsigned char *p;
 
-	n->size = (old_size + 1000) * 3 / 2;
-	n->hash = xcalloc(n->size, sizeof(struct object_decoration));
+	n->stride = sizeof(struct object_decoration) + WIDTH(n);
+	n->size = (n->size + 1000) * 3 / 2;
+	n->hash = xcalloc(n->size, n->stride);
+	n->end = n->hash + (n->size * n->stride);
 	n->nr = 0;
 
-	for (i = 0; i < old_size; i++) {
-		const struct object *base = old_hash[i].base;
-		void *decoration = old_hash[i].decoration;
-
-		if (!base)
+	for (p = old_hash; p < old_end; p += n->stride) {
+		if (!OBJ(p)->base)
 			continue;
-		insert_decoration(n, base, decoration);
+		insert_decoration(n, OBJ(p)->base, OBJ(p)->decoration, NULL);
 	}
 	free(old_hash);
 }
 
 /* Add a decoration pointer, return any old one */
 void *add_decoration(struct decoration *n, const struct object *obj,
-		void *decoration)
+		     void *decoration)
 {
-	int nr = n->nr + 1;
+	void *old = NULL;
 
-	if (nr > n->size * 2 / 3)
+	add_decoration_value(n, obj, &decoration, &old);
+	return old;
+}
+
+int add_decoration_value(struct decoration *n,
+			 const struct object *obj,
+			 const void *decoration,
+			 void *old)
+{
+	if (n->nr + 1 > n->size * 2 / 3)
 		grow_decoration(n);
-	return insert_decoration(n, obj, decoration);
+	return insert_decoration(n, obj, decoration, old);
 }
 
+
 /* Lookup a decoration pointer */
-void *lookup_decoration(struct decoration *n, const struct object *obj)
+void *lookup_decoration(const struct decoration *n, const struct object *obj)
 {
-	unsigned int j;
+	void **v;
+
+	v = lookup_decoration_value(n, obj);
+	if (!v)
+		return NULL;
+	return *v;
+}
+
+void *lookup_decoration_value(const struct decoration *n,
+			      const struct object *obj)
+{
+	const unsigned char *p;
 
 	/* nothing to lookup */
 	if (!n->size)
 		return NULL;
-	j = hash_obj(obj, n->size);
+
+	p = REF_OBJECT(n, obj);
 	for (;;) {
-		struct object_decoration *ref = n->hash + j;
+		struct object_decoration *ref = OBJ(p);
 		if (ref->base == obj)
 			return ref->decoration;
 		if (!ref->base)
 			return NULL;
-		if (++j == n->size)
-			j = 0;
+		p += n->stride;
+		if (p >= n->end)
+			p = n->hash;
 	}
 }
diff --git a/decorate.h b/decorate.h
index e732804..b75054a 100644
--- a/decorate.h
+++ b/decorate.h
@@ -3,16 +3,32 @@
 
 struct object_decoration {
 	const struct object *base;
-	void *decoration;
+	unsigned char decoration[FLEX_ARRAY];
 };
 
 struct decoration {
 	const char *name;
+	/* width of data we're holding; must be set before adding */
+	unsigned int width;
 	unsigned int size, nr;
-	struct object_decoration *hash;
+	/*
+	 * The hash contains object_decoration structs, but we don't know their
+	 * size until runtime. So we store is as a pointer to characters to
+	 * make pointer arithmetic easier.
+	 */
+	unsigned char *hash;
+	unsigned int stride; /* size of a single record */
+	unsigned char *end; /* end of the hash array */
 };
 
 extern void *add_decoration(struct decoration *n, const struct object *obj, void *decoration);
-extern void *lookup_decoration(struct decoration *n, const struct object *obj);
+extern void *lookup_decoration(const struct decoration *n, const struct object *obj);
+
+extern int add_decoration_value(struct decoration *n,
+				const struct object *obj,
+				const void *decoration,
+				void *old);
+extern void *lookup_decoration_value(const struct decoration *n,
+				     const struct object *obj);
 
 #endif
-- 
1.7.6.37.g989c6

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/5] add object-cache infrastructure
  2011-07-11 16:13 [RFC/PATCH 0/5] generation numbers for faster traversals Jeff King
  2011-07-11 16:16 ` [PATCH 1/5] decorate: allow storing values instead of pointers Jeff King
@ 2011-07-11 16:17 ` Jeff King
  2011-07-11 16:46   ` Jeff King
                     ` (2 more replies)
  2011-07-11 16:18 ` [PATCH 3/5] commit: add commit_generation function Jeff King
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 35+ messages in thread
From: Jeff King @ 2011-07-11 16:17 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher

There is sometimes a need to cache some information about an
object or set of objects persistently across git
invocations. The notes-cache interface can be used for this,
but it is very heavyweight and slow for storing small
values.

This patch introduces a new API, object-cache, which stores
a mapping of objects to values in a concise and efficient
form. See the added API documentation for details.

Signed-off-by: Jeff King <peff@peff.net>
---
As I mentioned earlier, I wanted this to be generic and size-agnostic,
because I'd also like to try caching patch-ids for git-cherry.

 Documentation/technical/api-decorate.txt     |    3 +
 Documentation/technical/api-object-cache.txt |  101 +++++++++++++
 Makefile                                     |    2 +
 object-cache.c                               |  204 ++++++++++++++++++++++++++
 object-cache.h                               |   32 ++++
 5 files changed, 342 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/technical/api-object-cache.txt
 create mode 100644 object-cache.c
 create mode 100644 object-cache.h

diff --git a/Documentation/technical/api-decorate.txt b/Documentation/technical/api-decorate.txt
index 8d10b66..c851f59 100644
--- a/Documentation/technical/api-decorate.txt
+++ b/Documentation/technical/api-decorate.txt
@@ -13,6 +13,9 @@ could, for example, map objects into 32-bit integers. For ease of use,
 functions are provided for storing the values of arbitrary pointers,
 which can point to strings or structs.
 
+Note that the decorate API only stores the mapping in memory. See the
+object-cache API for persistent storage.
+
 Data Structures
 ---------------
 
diff --git a/Documentation/technical/api-object-cache.txt b/Documentation/technical/api-object-cache.txt
new file mode 100644
index 0000000..ecead4c
--- /dev/null
+++ b/Documentation/technical/api-object-cache.txt
@@ -0,0 +1,101 @@
+Object Cache API
+================
+
+The object cache API is meant to store a cache of per-object values. The
+aim is for robustness, speed, and simplicity. The stored data must be of
+a fixed size, and the only operations allowed are insertion and
+retrieval with a struct object as the key.
+
+This API is similar to the decorate API, but provides persistence of
+values across multiple invocations. It is also similar to the
+notes-cache API, but is much lighter weight and suitable for storing
+small values. If you are storing large, arbitrary data, consider using
+notes-cache.
+
+
+Storage
+-------
+
+Values are stored both on-disk and in-memory. Newly added values are
+initially stored in a hash table in memory, and written to disk only
+when `object_cache_write` is called.
+
+The disk storage consists of a single file per cache, located in the
+`$GIT_DIR/cache` directory. Each file contains a list of rows ordered by
+sha1. Each row contains a binary sha1, followed by the fixed-size data
+mapped to that sha1.
+
+When the cache is written to disk, the contents of the in-memory data
+and the disk data are merged, with in-memory values taking precedence
+over disk values. The data is written to a temporary file and atomically
+renamed into the new cache file. Thus there is no lock contention
+between competing processes on either reading or writing (though one
+process's updates may be lost).
+
+
+Speed
+-----
+
+Lookup in the cache requires `O(lg(n))` hash comparisons (via binary
+search of the disk contents, or the in-memory hash table).
+
+Insertion into the cache is amortized `O(1)` via the hash table. Writing
+the cache to disk entails `O(n*lg(n) + m)` hash comparisons, where `m`
+is the number of existing disk entries and `n` is the number of newly
+added entries.
+
+
+Data Structures
+---------------
+
+`struct object_cache`::
+
+	This structure represents a single object cache. Its contents
+	should be considered opaque by calling code.
++
+	The `fd`, `disk_entries`, `disk_nr`, and `record_size` fields
+	represent the mmap'd on-disk data. The `mem` field represents
+	the in-memory data (see the decorate API for details).
+
+
+Functions
+---------
+
+`object_cache_init`::
+
+	Initialize a new object cache; this must be called before any
+	other functions are called on the cache. The `name` parameter
+	specifies a human-readable name which will be used for storage
+	in `$GIT_DIR/cache/$name`. The `width` parameter specifies the
+	size, in units of `char`, of the data to be stored (e.g., pass
+	in `sizeof(uint32_t)` to store 32-bit integers).
+
+`object_cache_lookup`::
+
+	Retrieve a value from the object cache. A void pointer to the
+	stored value will be returned, or `NULL` if there is no value.
+
+`object_cache_add`::
+
+	Store a value in the object cache. The value pointer should
+	point to exactly `width` bytes of data.
+
+`object_cache_write`::
+
+	Merge the in-memory and existing disk entries, and write the
+	resulting cache to disk, under the name `$GIT_DIR/cache/$name`.
+	If no in-memory entries exist, the on-disk cache will be left
+	unchanged.
+
+`object_cache_lookup_uint32`::
+
+	Convenience wrapper for retrieving unsigned 32-bit integers. The
+	value will be returned via the `value` pointer. The return value
+	is `0` if a value was found, or negative otherwise (in which
+	case the contents of `value` will be unchanged).
+
+`object_cache_add_uint32`::
+
+	Convenience wrapper for storing unsigned 32-bit integers. Note
+	that integers are stored on disk in network-byte order, so it is
+	safe to access caches from any architecture.
diff --git a/Makefile b/Makefile
index f8c72e1..dd05579 100644
--- a/Makefile
+++ b/Makefile
@@ -535,6 +535,7 @@ LIB_H += merge-recursive.h
 LIB_H += notes.h
 LIB_H += notes-cache.h
 LIB_H += notes-merge.h
+LIB_H += object-cache.h
 LIB_H += object.h
 LIB_H += pack.h
 LIB_H += pack-refs.h
@@ -628,6 +629,7 @@ LIB_OBJS += name-hash.o
 LIB_OBJS += notes.o
 LIB_OBJS += notes-cache.o
 LIB_OBJS += notes-merge.o
+LIB_OBJS += object-cache.o
 LIB_OBJS += object.o
 LIB_OBJS += pack-check.o
 LIB_OBJS += pack-refs.o
diff --git a/object-cache.c b/object-cache.c
new file mode 100644
index 0000000..d019f0a
--- /dev/null
+++ b/object-cache.c
@@ -0,0 +1,204 @@
+#include "cache.h"
+#include "object-cache.h"
+#include "sha1-lookup.h"
+#include "object.h"
+
+static const char *object_cache_path(const char *name)
+{
+	return git_path("cache/%s", name);
+}
+
+static void open_disk_cache(struct object_cache *c, const char *path)
+{
+	struct stat sb;
+
+	c->fd = open(path, O_RDONLY);
+	if (c->fd < 0)
+		return;
+
+	if (fstat(c->fd, &sb) < 0) {
+		close(c->fd);
+		c->fd = -1;
+		return;
+	}
+
+	c->disk_entries = xmmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE,
+				c->fd, 0);
+	c->disk_nr = sb.st_size / c->record_size;
+}
+
+void object_cache_init(struct object_cache *c, const char *name, int width)
+{
+	memset(c, 0, sizeof(*c));
+	c->record_size = 20 + width;
+	c->mem.width = width;
+	open_disk_cache(c, object_cache_path(name));
+}
+
+static void *lookup_disk(const struct object_cache *c,
+			 const struct object *obj)
+{
+	int pos;
+
+	pos = sha1_entry_pos(c->disk_entries, c->record_size, 0,
+			     0, c->disk_nr, c->disk_nr, obj->sha1);
+	if (pos < 0)
+		return NULL;
+
+	return c->disk_entries + (pos * c->record_size) + 20;
+}
+
+const void *object_cache_lookup(const struct object_cache *c,
+					 const struct object *obj)
+{
+	void *r;
+
+	r = lookup_decoration_value(&c->mem, obj);
+	if (!r)
+		r = lookup_disk(c, obj);
+	return r;
+}
+
+void object_cache_add(struct object_cache *c, const struct object *obj,
+		      const void *value)
+{
+	add_decoration_value(&c->mem, obj, value, NULL);
+}
+
+static unsigned char *flatten_mem_entries(struct object_cache *c)
+{
+	unsigned char *p;
+	unsigned char *ret;
+	int nr;
+
+	ret = xmalloc(c->mem.nr * c->record_size);
+	nr = 0;
+	for (p = c->mem.hash; p < c->mem.end; p += c->mem.stride) {
+		struct object_decoration *e = (struct object_decoration *)p;
+		unsigned char *out;
+
+		if (!e->base)
+			continue;
+
+		if (nr == c->mem.nr)
+			die("BUG: decorate hash contained extra values");
+
+		out = ret + (nr * c->record_size);
+		hashcpy(out, e->base->sha1);
+		out += 20;
+		memcpy(out, e->decoration, c->mem.width);
+		nr++;
+	}
+
+	return ret;
+}
+
+static int void_hashcmp(const void *a, const void *b)
+{
+	return hashcmp(a, b);
+}
+
+int object_cache_write(struct object_cache *c, const char *name)
+{
+	const char *path = object_cache_path(name);
+	struct strbuf tempfile = STRBUF_INIT;
+	int fd;
+	unsigned char *mem_entries;
+	int i, j;
+	int r;
+
+	if (!c->mem.nr)
+		return 0;
+
+	strbuf_addf(&tempfile, "%s.XXXXXX", path);
+	if (safe_create_leading_directories(tempfile.buf) < 0 ||
+	    (fd = git_mkstemp_mode(tempfile.buf, 0755)) < 0) {
+		strbuf_release(&tempfile);
+		return -1;
+	}
+
+	mem_entries = flatten_mem_entries(c);
+	qsort(mem_entries, c->mem.nr, c->record_size, void_hashcmp);
+
+#define WRITE_ENTRY(e) \
+do { \
+	if (write_in_full(fd, e, c->record_size) < 0) { \
+		close(fd); \
+		unlink(tempfile.buf); \
+		strbuf_release(&tempfile); \
+		free(mem_entries); \
+		return -1; \
+	} \
+} while (0)
+
+	i = j = 0;
+	while (i < c->mem.nr && j < c->disk_nr) {
+		unsigned char *mem = mem_entries + (i * c->record_size);
+		unsigned char *disk = c->disk_entries + (j * c->record_size);
+		int cmp = hashcmp(mem, disk);
+
+		/* replace disk entries with in-memory ones */
+		if (cmp == 0)
+			j++;
+		else if (cmp < 0) {
+			WRITE_ENTRY(mem);
+			i++;
+		}
+		else {
+			WRITE_ENTRY(disk);
+			j++;
+		}
+	}
+	while (i < c->mem.nr) {
+		unsigned char *mem = mem_entries + (i * c->record_size);
+		WRITE_ENTRY(mem);
+		i++;
+	}
+	while (j < c->disk_nr) {
+		unsigned char *disk = c->disk_entries + (j * c->record_size);
+		WRITE_ENTRY(disk);
+		j++;
+	}
+#undef WRITE_ENTRY
+
+	free(mem_entries);
+
+	if (close(fd) < 0 || rename(tempfile.buf, path) < 0)
+		r = -1;
+	else
+		r = 0;
+
+	unlink(tempfile.buf);
+	strbuf_release(&tempfile);
+	return r;
+}
+
+int object_cache_lookup_uint32(const struct object_cache *c,
+			       const struct object *obj,
+			       uint32_t *value)
+{
+	const uint32_t *out;
+
+	if (c->record_size != 24)
+		die("BUG: size mismatch in object cache lookup (%d != 24)",
+		    c->record_size);
+
+	out = object_cache_lookup(c, obj);
+	if (!out)
+		return -1;
+
+	*value = ntohl(*out);
+	return 0;
+}
+
+void object_cache_add_uint32(struct object_cache *c,
+			     const struct object *obj,
+			     uint32_t value)
+{
+	if (c->record_size != 24)
+		die("BUG: size mismatch in object cache add (%d != 24)",
+		    c->record_size);
+
+	value = htonl(value);
+	object_cache_add(c, obj, &value);
+}
diff --git a/object-cache.h b/object-cache.h
new file mode 100644
index 0000000..b869f1e
--- /dev/null
+++ b/object-cache.h
@@ -0,0 +1,32 @@
+#ifndef OBJECT_CACHE_H
+#define OBJECT_CACHE_H
+
+#include "decorate.h"
+
+struct object_cache {
+	/* mmap'd disk entries */
+	int fd;
+	unsigned char *disk_entries;
+	int disk_nr;
+	int record_size;
+
+	/* in memory entries */
+	struct decoration mem;
+};
+
+void object_cache_init(struct object_cache *, const char *name, int width);
+const void *object_cache_lookup(const struct object_cache *,
+				const struct object *);
+void object_cache_add(struct object_cache *, const struct object *,
+		      const void *value);
+int object_cache_write(struct object_cache *, const char *name);
+
+/* Convenience wrappers around object_cache_{lookup,add} */
+int object_cache_lookup_uint32(const struct object_cache *,
+			       const struct object *,
+			       uint32_t *value);
+void object_cache_add_uint32(struct object_cache *,
+			     const struct object *,
+			     uint32_t value);
+
+#endif /* OBJECT_CACHE_H */
-- 
1.7.6.37.g989c6

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/5] commit: add commit_generation function
  2011-07-11 16:13 [RFC/PATCH 0/5] generation numbers for faster traversals Jeff King
  2011-07-11 16:16 ` [PATCH 1/5] decorate: allow storing values instead of pointers Jeff King
  2011-07-11 16:17 ` [PATCH 2/5] add object-cache infrastructure Jeff King
@ 2011-07-11 16:18 ` Jeff King
  2011-07-11 17:57   ` Clemens Buchacher
  2011-07-11 16:18 ` [PATCH 4/5] pretty: support %G to show the generation number of a commit Jeff King
  2011-07-11 16:18 ` [PATCH 5/5] limit "contains" traversals based on commit generation Jeff King
  4 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2011-07-11 16:18 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher

A commit's generation is its height in the history graph, as
measured from the farthest root. It is defined as:

  - If the commit has no parents, then its generation is 0.

  - Otherwise, its generation is 1 more than the maximum of
    its parents generations.

The following diagram shows a sample history with
generations:

  A(0)--B(1)--C(2)------G(5)--H(6)
         \             /
          D(2)--E(3)--F(4)

Note that C and D have the same generation, as they are both
children of B. Note also that the merge commit G's
generation is 5, not 3, as we take the maximum of its
parents' generations.

Generation numbers can be useful for bounding traversals.
For example, if we have two commits with generations 500 and
600, we know that the second cannot be an ancestor of the
first. The first could be an ancestor of the second, but we
can't know unless we traverse the history graph. However,
when walking backwards from the "600" commit, once we reach
generation "499", we know that the "500" commit cannot be an
ancestor of the "499" commit, and we can stop the traversal
without even looking at the earlier parts of the history.

We already do something similar with commit timestamps in
many traversals. However, timestamps are somewhat
untrustworthy, as we have to deal with clock skew and with
imports from buggy systems.

Generation numbers are easy to calculate recursively, though
you have to go to the roots to do so. This patch calculates
and stores them in a persistent cache.  It uses a simple
recursive algorithm; you could probably drop the recursion
by topologically sorting a list of all commits and filling
in generation numbers left to right. But the recursive
definition coupled with the cache make it very cheap to
calculate generation numbers for new commits at the tip of
history (you only have to traverse back to the last cached
parents).

We could also store generation numbers in the commit header
directly. These would be faster to look at than an external
cache (they would be on par speed-wise with commit
timestamps). But there are a few reasons not to:

  1. The reason to avoid commit timestamps is that they are
     unreliable. Generation numbers would probably be more
     reliable, but they are still redundant with the actual
     graph structure represented by the parent pointers, and
     can therefore be out of sync with the parent
     information. By calculating them ourselves, we know
     they are correct.

  2. With grafts and replacement objects, the graph
     structure (and thus the generation numbers) can be
     changed. So the generation number, while immutable for
     a given commit object, can be changed when we "lie"
     about the graph structure via these mechanisms. Being
     able to simply clear the cache when these things are
     changed is helpful.

  3. There's a lot of existing git history without
     generation headers. So we'd still need to have the same
     cache to handle those cases.

  4. It generally pollutes the header with redundant
     information, which we try to avoid. Putting it in the
     commit header is purely a speedup, and it seems we can
     get similar performance with a generation cache.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 commit.h |    2 ++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/commit.c b/commit.c
index ac337c7..d412e05 100644
--- a/commit.c
+++ b/commit.c
@@ -6,6 +6,7 @@
 #include "diff.h"
 #include "revision.h"
 #include "notes.h"
+#include "object-cache.h"
 
 int save_commit_buffer = 1;
 
@@ -878,3 +879,50 @@ int commit_tree(const char *msg, unsigned char *tree,
 	strbuf_release(&buffer);
 	return result;
 }
+
+static struct object_cache generations;
+
+static unsigned long commit_generation_recurse(struct commit *c)
+{
+	struct commit_list *p;
+	uint32_t r;
+
+	if (!object_cache_lookup_uint32(&generations, &c->object, &r))
+		return r;
+
+	if (parse_commit(c) < 0)
+		die("unable to parse commit: %s", sha1_to_hex(c->object.sha1));
+
+	if (!c->parents)
+		return 0;
+
+	r = 0;
+	for (p = c->parents; p; p = p->next) {
+		unsigned long pgen = commit_generation_recurse(p->item);
+		if (pgen > r)
+			r = pgen;
+	}
+	r++;
+
+	object_cache_add_uint32(&generations, &c->object, r);
+	return r;
+}
+
+static void write_generation_cache(void)
+{
+	object_cache_write(&generations, "generations");
+}
+
+unsigned long commit_generation(const struct commit *commit)
+{
+	static int initialized;
+
+	if (!initialized) {
+		object_cache_init(&generations, "generations", sizeof(uint32_t));
+		atexit(write_generation_cache);
+		initialized = 1;
+	}
+
+	/* drop const because we may call parse_commit */
+	return commit_generation_recurse((struct commit *)commit);
+}
diff --git a/commit.h b/commit.h
index a2d571b..bff6b36 100644
--- a/commit.h
+++ b/commit.h
@@ -176,4 +176,6 @@ extern int commit_tree(const char *msg, unsigned char *tree,
 		struct commit_list *parents, unsigned char *ret,
 		const char *author);
 
+unsigned long commit_generation(const struct commit *commit);
+
 #endif /* COMMIT_H */
-- 
1.7.6.37.g989c6

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 4/5] pretty: support %G to show the generation number of a commit
  2011-07-11 16:13 [RFC/PATCH 0/5] generation numbers for faster traversals Jeff King
                   ` (2 preceding siblings ...)
  2011-07-11 16:18 ` [PATCH 3/5] commit: add commit_generation function Jeff King
@ 2011-07-11 16:18 ` Jeff King
  2011-07-11 16:18 ` [PATCH 5/5] limit "contains" traversals based on commit generation Jeff King
  4 siblings, 0 replies; 35+ messages in thread
From: Jeff King @ 2011-07-11 16:18 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher

This might be useful for external programs doing topological
sorting or other graph analysis. It's also handy for testing
the generation calculation code.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/pretty-formats.txt |    1 +
 pretty.c                         |    3 +++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/Documentation/pretty-formats.txt b/Documentation/pretty-formats.txt
index 561cc9f..c58ab52 100644
--- a/Documentation/pretty-formats.txt
+++ b/Documentation/pretty-formats.txt
@@ -133,6 +133,7 @@ The placeholders are:
 - '%gD': reflog selector, e.g., `refs/stash@\{1\}`
 - '%gd': shortened reflog selector, e.g., `stash@\{1\}`
 - '%gs': reflog subject
+- '%G': generation number (i.e., distance of path to farthest root ancestor)
 - '%Cred': switch color to red
 - '%Cgreen': switch color to green
 - '%Cblue': switch color to blue
diff --git a/pretty.c b/pretty.c
index f45eb54..8f1b321 100644
--- a/pretty.c
+++ b/pretty.c
@@ -965,6 +965,9 @@ static size_t format_commit_one(struct strbuf *sb, const char *placeholder,
 			return 2;
 		}
 		return 0;	/* unknown %g placeholder */
+	case 'G':
+		strbuf_addf(sb, "%lu", commit_generation(commit));
+		return 1;
 	case 'N':
 		if (c->pretty_ctx->show_notes) {
 			format_display_notes(commit->object.sha1, sb,
-- 
1.7.6.37.g989c6

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 5/5] limit "contains" traversals based on commit generation
  2011-07-11 16:13 [RFC/PATCH 0/5] generation numbers for faster traversals Jeff King
                   ` (3 preceding siblings ...)
  2011-07-11 16:18 ` [PATCH 4/5] pretty: support %G to show the generation number of a commit Jeff King
@ 2011-07-11 16:18 ` Jeff King
  4 siblings, 0 replies; 35+ messages in thread
From: Jeff King @ 2011-07-11 16:18 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher

When looking for commits that contain other commits (e.g.,
via "git tag --contains"), we can end up traversing useless
portions of the graph. For example, if I am looking for a
tag that contains a commit made last week, there is not much
point in traversing portions of the history graph made five
years ago.

This optimization can provide massive speedups. For example,
doing "git tag --contains HEAD~1000" in the linux-2.6
repository goes from:

  real    0m3.139s
  user    0m3.044s
  sys     0m0.092s

to:

  real    0m0.035s
  user    0m0.028s
  sys     0m0.004s

We could use commit timestamps to know when we are going too
far back in history, but they are sometimes not trustworthy.
Extreme clock skew on committers' machines (or bugs in
commit-generating software) mean that we may stop the
traversal too early when seeing commits skewed into the
past.

Instead, we use the calculated commit generation, which is a
propery of the graph itself (but since we cache it, it's
still cheap to consult).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/tag.c |   20 +++++++++++++++++---
 1 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/builtin/tag.c b/builtin/tag.c
index 63bce6e..df6de47 100644
--- a/builtin/tag.c
+++ b/builtin/tag.c
@@ -40,7 +40,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 }
 
 static int contains_recurse(struct commit *candidate,
-			    const struct commit_list *want)
+			    const struct commit_list *want,
+			    unsigned long cutoff)
 {
 	struct commit_list *p;
 
@@ -57,9 +58,13 @@ static int contains_recurse(struct commit *candidate,
 	if (parse_commit(candidate) < 0)
 		return 0;
 
+	/* stop searching if we go too far back in time */
+	if (commit_generation(candidate) < cutoff)
+		return 0;
+
 	/* Otherwise recurse and mark ourselves for future traversals. */
 	for (p = candidate->parents; p; p = p->next) {
-		if (contains_recurse(p->item, want)) {
+		if (contains_recurse(p->item, want, cutoff)) {
 			candidate->object.flags |= TMP_MARK;
 			return 1;
 		}
@@ -70,7 +75,16 @@ static int contains_recurse(struct commit *candidate,
 
 static int contains(struct commit *candidate, const struct commit_list *want)
 {
-	return contains_recurse(candidate, want);
+	unsigned long cutoff = ULONG_MAX;
+	const struct commit_list *c;
+
+	for (c = want; c; c = c->next) {
+		unsigned long g = commit_generation(c->item);
+		if (g < cutoff)
+			cutoff = g;
+	}
+
+	return contains_recurse(candidate, want, cutoff);
 }
 
 static int show_reference(const char *refname, const unsigned char *sha1,
-- 
1.7.6.37.g989c6

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/5] decorate: allow storing values instead of pointers
  2011-07-11 16:16 ` [PATCH 1/5] decorate: allow storing values instead of pointers Jeff King
@ 2011-07-11 16:39   ` Jeff King
  2011-07-11 19:06   ` Junio C Hamano
  1 sibling, 0 replies; 35+ messages in thread
From: Jeff King @ 2011-07-11 16:39 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher

On Mon, Jul 11, 2011 at 12:16:50PM -0400, Jeff King wrote:

> diff --git a/decorate.h b/decorate.h
> index e732804..b75054a 100644
> --- a/decorate.h
> +++ b/decorate.h
> @@ -3,16 +3,32 @@
>  
>  struct object_decoration {
>  	const struct object *base;
> -	void *decoration;
> +	unsigned char decoration[FLEX_ARRAY];
>  };
>  
>  struct decoration {
>  	const char *name;
> +	/* width of data we're holding; must be set before adding */
> +	unsigned int width;
>  	unsigned int size, nr;
> -	struct object_decoration *hash;
> +	/*
> +	 * The hash contains object_decoration structs, but we don't know their
> +	 * size until runtime. So we store is as a pointer to characters to
> +	 * make pointer arithmetic easier.
> +	 */
> +	unsigned char *hash;
> +	unsigned int stride; /* size of a single record */
> +	unsigned char *end; /* end of the hash array */
>  };

As you can see here, there's a bit of C magic here. We don't know what
we're storing in the object_decoration struct, so it has a variable
size. So there were two subtle things I was concerned about:

  1. Not violating alignment rules when accessing items.

  2. Not violating strict aliasing rules.

I think (1) is OK via the add/lookup interface. It takes void pointers
to objects to store and to retrieve into (and knows the width via the
width field above). And then we memcpy() into and out of the flex-array.
Which means you never access the bytes in the flex array as an "int" or
whatever; you memcpy them into an _actual_ int, and then access them.

I don't think we have to worry about strict aliasing, for two reasons:

  1. Pointers to char (or unsigned char) are assumed to be aliases for
     other things by gcc's strict aliasing rules. Because they are the
     only way to do byte-level pointer arithmetic, which makes them
     useful for stuff like this.

  2. The pointer accesses are hidden across function call boundaries,
     anyway, so I suspect the optimizer has to assume the worst.

What's slightly worrisome to me is the iteration code in fast-export.c
that touches the decoration struct more directly:

> +	for (p = idnums.hash; p < idnums.end; p += idnums.stride) {
> +		struct object_decoration *deco = (struct object_decoration *)p;
>  		if (deco->base && deco->base->type == 1) {
> -			mark = ptr_to_mark(deco->decoration);
> -			if (fprintf(f, ":%"PRIu32" %s\n", mark,
> +			uint32_t *mark = (uint32_t *)deco->decoration;
> +			if (fprintf(f, ":%"PRIu32" %s\n", *mark,
>  				sha1_to_hex(deco->base->sha1)) < 0) {

Here we're directly casting the bytes to a uint32 and accessing them.
I'm not sure if this can create alignment issues. The bytes are in a
struct which is in an array of such packed structs. I suspect in
practice that a uint32_t is not a problem, as the data bytes will always
be on a 4-byte boundary. It might matter for other types, though.

As for strict aliasing, I thought that the "pointer to char can be an
alias" rule would save us. But actually, from my reading, the rule is "a
pointer to char can be an alias to another object" but not necessarily
that "a pointer to another object can be an alias to char". What has me
confused is that gcc complains about type-punning under -O3 (which turns
on -fstrict-aliasing) with this:

  struct object_decoration *deco = (struct object_decoration *)p;
  uint32_t mark = *(uint32_t *)deco->decoration;
  fprintf("%"PRIu32, mark);

but not with the code in the patch:

  struct object_decoration *deco = (struct object_decoration *)p;
  uint32_t *mark = (uint32_t *)deco->decoration;
  fprintf("%"PRIu32, *mark);

so there may be something I'm not understanding.

But I wonder if the safest thing for both alignment and aliasing would
just be:

  struct object_decoration *deco = (struct object_decoration *)p;
  uint32_t mark;
  memcpy(&mark, deco->decoration, sizeof(mark));

Which is probably a little slower, but I'm not sure it's worth
micro-optimizing this loop[1].

-Peff

[1] Actually, if you read the rest of the patch, you will note that what
    used to be assignment to or reading from a void pointer in the
    original decorate.[ch] is now a memcpy, as well. The original
    probably compiled to a single assignment instruction, and now we
    loop via memcpy. That's the price of run-time flexibility. I doubt
    that the extra few instructions are a big deal, considering we by
    definition have just done a lookup in a hash table.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-11 16:17 ` [PATCH 2/5] add object-cache infrastructure Jeff King
@ 2011-07-11 16:46   ` Jeff King
  2011-07-11 16:58     ` Shawn Pearce
  2011-07-11 19:17   ` Junio C Hamano
  2011-07-12 10:41   ` Jakub Narebski
  2 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2011-07-11 16:46 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher

On Mon, Jul 11, 2011 at 12:17:54PM -0400, Jeff King wrote:

> +Storage
> +-------
> [...]
> +The disk storage consists of a single file per cache, located in the
> +`$GIT_DIR/cache` directory. Each file contains a list of rows ordered by
> +sha1. Each row contains a binary sha1, followed by the fixed-size data
> +mapped to that sha1.

There's nothing else in the file, not even a header. I should probably
add at least a version byte, in case we ever do want to change what goes
into a cache (in which case we could just blow away what's there and
regenerate).

We could also have a "validity" field that must match for the cache to
be valid. Junio mentioned earlier that you would want to regenerate a
generations cache whenever grafts or replace objects changed. It
wouldn't be hard to do something like:

  validity = sha1(grafts + replace);
  if (validity != cache_validity)
    empty_disk_cache();

And then it would just automatically work, without the user having to
remember to clear the cache.  Calculating the validity would be cheap in
the common case (you have no grafts or replace objects), and not much
more expensive when you do have them (we have to read that data anyway
to respect the replacements).

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-11 16:46   ` Jeff King
@ 2011-07-11 16:58     ` Shawn Pearce
  0 siblings, 0 replies; 35+ messages in thread
From: Shawn Pearce @ 2011-07-11 16:58 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Jakub Narebski, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð, Clemens Buchacher

On Mon, Jul 11, 2011 at 09:46, Jeff King <peff@peff.net> wrote:
> On Mon, Jul 11, 2011 at 12:17:54PM -0400, Jeff King wrote:
>
>> +Storage
>> +-------
>> [...]
>> +The disk storage consists of a single file per cache, located in the
>> +`$GIT_DIR/cache` directory. Each file contains a list of rows ordered by
>> +sha1. Each row contains a binary sha1, followed by the fixed-size data
>> +mapped to that sha1.
>
> There's nothing else in the file, not even a header. I should probably
> add at least a version byte, in case we ever do want to change what goes
> into a cache (in which case we could just blow away what's there and
> regenerate).

Please do. We don't always need to invent a new version... but for a
cache file its probably more common than the PACK network stream, or
the pack-*.idx file. And its painful to shoehorn in a version change
after the fact.

> We could also have a "validity" field that must match for the cache to
> be valid. Junio mentioned earlier that you would want to regenerate a
> generations cache whenever grafts or replace objects changed. It
> wouldn't be hard to do something like:
>
>  validity = sha1(grafts + replace);
>  if (validity != cache_validity)
>    empty_disk_cache();
>
> And then it would just automatically work, without the user having to
> remember to clear the cache.  Calculating the validity would be cheap in
> the common case (you have no grafts or replace objects), and not much
> more expensive when you do have them (we have to read that data anyway
> to respect the replacements).

This is a pretty good idea too. Saves the hapless user from surprises
when they forget the cache exists, but they changed grafts or replace
data.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/5] commit: add commit_generation function
  2011-07-11 16:18 ` [PATCH 3/5] commit: add commit_generation function Jeff King
@ 2011-07-11 17:57   ` Clemens Buchacher
  2011-07-11 21:10     ` Jeff King
  0 siblings, 1 reply; 35+ messages in thread
From: Clemens Buchacher @ 2011-07-11 17:57 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Jakub Narebski, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð Bjarmason

On Mon, Jul 11, 2011 at 12:18:14PM -0400, Jeff King wrote:
>
> +unsigned long commit_generation(const struct commit *commit)
> +{
[...]
> +	/* drop const because we may call parse_commit */
> +	return commit_generation_recurse((struct commit *)commit);
> +}

Out of curiosity, why make it const in the first place?

Clemens

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/5] decorate: allow storing values instead of pointers
  2011-07-11 16:16 ` [PATCH 1/5] decorate: allow storing values instead of pointers Jeff King
  2011-07-11 16:39   ` Jeff King
@ 2011-07-11 19:06   ` Junio C Hamano
  2011-07-11 21:20     ` Jeff King
  1 sibling, 1 reply; 35+ messages in thread
From: Junio C Hamano @ 2011-07-11 19:06 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher

Jeff King <peff@peff.net> writes:

>  static void grow_decoration(struct decoration *n)
>  {
> -	int i;
> -	int old_size = n->size;
> -	struct object_decoration *old_hash = n->hash;
> +	unsigned char *old_hash = n->hash;
> +	unsigned char *old_end = n->end;
> +	unsigned char *p;
>  
> -	n->size = (old_size + 1000) * 3 / 2;
> -	n->hash = xcalloc(n->size, sizeof(struct object_decoration));
> +	n->stride = sizeof(struct object_decoration) + WIDTH(n);

This value should not change once it is initialized, or all h*ll breaks
loose while accessing the old-hash, right?  Just wondering if it makes the
intention clearer if the function had something like this in it:

	if (!old_size) {
		/* initial */
                n->stride = ...
	} else {
		/* rehash to grow */
	}

I am mostly worried about both width and stride being assignable
fields. An alternative may be to expose

	int decoration_stride(struct decoration *n)
        {
        	return sizeof(struct object_decoration) + WIDTH(n);
	}

to the outside callers and drop "stride" field.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-11 16:17 ` [PATCH 2/5] add object-cache infrastructure Jeff King
  2011-07-11 16:46   ` Jeff King
@ 2011-07-11 19:17   ` Junio C Hamano
  2011-07-11 22:01     ` Jeff King
  2011-07-12 10:41   ` Jakub Narebski
  2 siblings, 1 reply; 35+ messages in thread
From: Junio C Hamano @ 2011-07-11 19:17 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher

Jeff King <peff@peff.net> writes:

> As I mentioned earlier, I wanted this to be generic and size-agnostic,
> because I'd also like to try caching patch-ids for git-cherry.

Sounds like a good thing to aim for, but "Object Cache" sounded too much
like giving access to the object data that is faster than either loose or
packed objects.

This is a completely unrelated tangent but because you brought up
patch-ids ;-), the other day I tried to rebase mm patches on top of
updated linux-next while trying to help Andrew, and noticed that in real
life, many "duplicate" patches you find in updated upstream are "slightly
reworded better" and "rebase --skip" is often the right thing to do, but
it is often very difficult to diagnose, as (1) the patch you are trying to
apply from your old series may be part of a long patch series, and (2) the
commit you are trying to re-apply the said patch to, i.e. the updated
upstream, may already contain many of these semi-duplicate patches. The
conflict resulting from "am -3" in such a situation is not very pleasant
to look at (it looks mostly as if you are reverting the effect of updated
versions of later patches in your series).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/5] commit: add commit_generation function
  2011-07-11 17:57   ` Clemens Buchacher
@ 2011-07-11 21:10     ` Jeff King
  0 siblings, 0 replies; 35+ messages in thread
From: Jeff King @ 2011-07-11 21:10 UTC (permalink / raw)
  To: Clemens Buchacher
  Cc: git, Junio C Hamano, Jakub Narebski, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð Bjarmason

On Mon, Jul 11, 2011 at 07:57:09PM +0200, Clemens Buchacher wrote:

> On Mon, Jul 11, 2011 at 12:18:14PM -0400, Jeff King wrote:
> >
> > +unsigned long commit_generation(const struct commit *commit)
> > +{
> [...]
> > +	/* drop const because we may call parse_commit */
> > +	return commit_generation_recurse((struct commit *)commit);
> > +}
> 
> Out of curiosity, why make it const in the first place?

Two reasons:

  1. At the API layer, it's conceptually const. We're not changing the
     commit, but the lazy load of parse_commit is an implementation
     detail.

     In C++, you would stick a "mutable" tag on the parts we lazily load
     via parse_object. Here, we have to cast.

     This isn't C++, of course, and while we do follow some
     object-oriented principles, it's not necessarily worth fighting the
     language like this for the sake of a const. And I would be fine
     with saying "all commit objects should not be marked const, because
     we may lazily parse them, and it's well known that they are not to
     be freed anyway".

     But...

  2. The callsite in pretty.c has a const commit, so we have to cast
     somewhere, and this spot seemed the most appropriate to me (or we
     could drop the consts there, which I would be OK with).

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/5] decorate: allow storing values instead of pointers
  2011-07-11 19:06   ` Junio C Hamano
@ 2011-07-11 21:20     ` Jeff King
  0 siblings, 0 replies; 35+ messages in thread
From: Jeff King @ 2011-07-11 21:20 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher

On Mon, Jul 11, 2011 at 12:06:08PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >  static void grow_decoration(struct decoration *n)
> >  {
> > -	int i;
> > -	int old_size = n->size;
> > -	struct object_decoration *old_hash = n->hash;
> > +	unsigned char *old_hash = n->hash;
> > +	unsigned char *old_end = n->end;
> > +	unsigned char *p;
> >  
> > -	n->size = (old_size + 1000) * 3 / 2;
> > -	n->hash = xcalloc(n->size, sizeof(struct object_decoration));
> > +	n->stride = sizeof(struct object_decoration) + WIDTH(n);
> 
> This value should not change once it is initialized, or all h*ll breaks
> loose while accessing the old-hash, right?  Just wondering if it makes the
> intention clearer if the function had something like this in it:
> 
> 	if (!old_size) {
> 		/* initial */
>                 n->stride = ...
> 	} else {
> 		/* rehash to grow */
> 	}

Yes, things will break horribly if you ever change the width. I wonder
if we can get some language support with:

  struct decoration {
     ...
     const int width;
     ...
  };

which would require that calling code set the width during
initialization. Doing anything else is more or less insane, although it
does make trouble for the object-cache code, which uses a run-time
function to initialize itself.  But as it's somewhat tightly coupled
with decorations and knows what it's doing, I would not be opposed to
casting away the constness in that one spot with a comment as a
workaround.

> I am mostly worried about both width and stride being assignable
> fields. An alternative may be to expose
> 
> 	int decoration_stride(struct decoration *n)
>         {
>         	return sizeof(struct object_decoration) + WIDTH(n);
> 	}
> 
> to the outside callers and drop "stride" field.

Yeah, I considered that (the "end" field is calculable, too). I just
didn't want the overhead of:

  for (p = d->hash; p < decoration_end(d); p += decoration_stride(d))
     ...

But I suppose we can inline both of those and let the optimizer deal
with hoisting the actual calculations them out of the loop. Or maybe I'm
just being stupid and optimizing too early. Speed is definitely one of
my criteria for this series, though.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-11 19:17   ` Junio C Hamano
@ 2011-07-11 22:01     ` Jeff King
  2011-07-11 23:21       ` Junio C Hamano
  2011-07-12  6:36       ` Miles Bader
  0 siblings, 2 replies; 35+ messages in thread
From: Jeff King @ 2011-07-11 22:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher

On Mon, Jul 11, 2011 at 12:17:56PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > As I mentioned earlier, I wanted this to be generic and size-agnostic,
> > because I'd also like to try caching patch-ids for git-cherry.
> 
> Sounds like a good thing to aim for, but "Object Cache" sounded too much
> like giving access to the object data that is faster than either loose or
> packed objects.

Agreed. I'm open to better suggestions for the name. I would have just
called it a straight out "cache" as it can be used for caching anything,
but that term is already taken in git. :)

You could argue we are caching metadata about objects, so something like
object_metadata_cache might be OK. I dunno.

> This is a completely unrelated tangent but because you brought up
> patch-ids ;-), the other day I tried to rebase mm patches on top of
> updated linux-next while trying to help Andrew, and noticed that in real
> life, many "duplicate" patches you find in updated upstream are "slightly
> reworded better" and "rebase --skip" is often the right thing to do, but
> it is often very difficult to diagnose, as (1) the patch you are trying to
> apply from your old series may be part of a long patch series, and (2) the
> commit you are trying to re-apply the said patch to, i.e. the updated
> upstream, may already contain many of these semi-duplicate patches. The
> conflict resulting from "am -3" in such a situation is not very pleasant
> to look at (it looks mostly as if you are reverting the effect of updated
> versions of later patches in your series).

Yeah, I run into this in git.git because I aggressively rebase my topics
on what you apply upstream. Usually it's very helpful, as I either skip
patches automatically if patch-ids match (which means I throw away my
version in favor of what you marked up based on list comments), or I get
a conflict that looks like:

  <<<<<<< HEAD
  /* what you marked up */
  =======
  /* what I originally had */
  >>>>>>> jk/foobar

and I can easily confirm that what you have is better and resolve in
favor (or if not better, I can go on the list and complain :) ).

But what I sometimes run into, and what I think you are mentioning with
"looks as if you are reverting" is when you have changes that textually
build on one another. For example:

  git init repo &&
  cd repo &&

  # make one base commit
  echo base >base && git add base && git commit -m base &&

  # now make a "topic" of two commits that textually build on one another
  git checkout -b topic &&
  echo one >file && git add file && git commit -m one &&
  echo two >file && git add file && git commit -m two &&

  # now pretend that they got applied upstream, but with a
  # slight tweak only in the first patch so that patch-ids don't match
  git checkout master &&
  echo 'modified one' >file && git add file && git commit -m one &&
  echo 'two' >file && git add file && git commit -m two &&

  # and now rebase the series on top
  git rebase master topic

What you would like to see (and what you would see without the second
commit) is:

  <<<<<<< HEAD
  modified one
  =======
  one
  >>>>>>> one

which quite obviously shows that your patch was marked up, and the
resolution is clear. But with a commit no top, you get:

  <<<<<<< HEAD
  two
  =======
  one
  >>>>>>> one

which looks like you are reverting, because of course you are building
on top of the finished series.

I think the only solution would be to do a better job of heuristically
matching up commits when rebase skips already-in-upstream commits. I
know we discussed it recently and decided that the false positives are
too dangerous for it to just skip based on something like the commit
message. I.e., even though they probably _are_ the same commit, the fact
that the patches don't match is actually really important, and the user
needs to see the conflict.

But we can show them the differences in other ways besides trying to
apply the patch and coming up with a conflict. For example, for any
commit that we think we have already in upstream (by commit message or
whatever heuristic), but whose patch-id is not found, we could show the
interdiff between the possible upstream commit and the rebased commit,
and say "Do you want to skip this?".

Then instead of having to look at the merge conflict of commit one on
top of upstream's commit two, you get to see what commit one would look
like on top of the upstream's commit one. Which is a lot more readable.
So the human is still involved, and still gets to see the conflicting
text, but it's in a much more useful form.

Does that make sense? In this example, I would expect something like:

  $ git rebase master topic
  Applying: one
  Using index info to reconstruct a base tree...
  Falling back to patching base and 3-way merge...
  Auto-merging file
  CONFLICT (content): Merge conflict in file
  Failed to merge in the changes.
  Patch failed at 0001 one

  hint: this commit may already be in upstream as 1234abcd;
  hint: the differences between that commit and this one are:

  diff --git a/file b/file
  --- a/file
  +++ b/file
  @@ -1 +1 @@
  -modified one
  +one

  When you have resolved this problem run "git rebase --continue".
  If you would prefer to skip this patch, instead run "git rebase --skip".
  To restore the original branch and stop rebasing run "git rebase --abort".

And then the user can decide to look at the conflict, or skip based on
the interdiff. For that matter, we can let them run the equivalent of
the interdiff themselves if we simply give them commit sha1s for the
potential upstream and the patch we're applying.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-11 22:01     ` Jeff King
@ 2011-07-11 23:21       ` Junio C Hamano
  2011-07-11 23:42         ` Junio C Hamano
                           ` (2 more replies)
  2011-07-12  6:36       ` Miles Bader
  1 sibling, 3 replies; 35+ messages in thread
From: Junio C Hamano @ 2011-07-11 23:21 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher

Jeff King <peff@peff.net> writes:

> ... But with a commit no top, you get:
>
>   <<<<<<< HEAD
>   two
>   =======
>   one
>   >>>>>>> one
>
> which looks like you are reverting, because of course you are building
> on top of the finished series.

Exactly.

>   $ git rebase master topic
>   Applying: one
>   Using index info to reconstruct a base tree...
>   Falling back to patching base and 3-way merge...
>   Auto-merging file
>   CONFLICT (content): Merge conflict in file
>   Failed to merge in the changes.
>   Patch failed at 0001 one
>
>   hint: this commit may already be in upstream as 1234abcd;
>   hint: the differences between that commit and this one are:
>
>   diff --git a/file b/file
>   --- a/file
>   +++ b/file
>   @@ -1 +1 @@
>   -modified one
>   +one
>
>   When you have resolved this problem run "git rebase --continue".
>   If you would prefer to skip this patch, instead run "git rebase --skip".
>   To restore the original branch and stop rebasing run "git rebase --abort".

Actually I do not think identifying the ones that can safely skipped is
such a big issue. The case I am most concerned about is when you see that
"two reverted back to one" (which you obviously want to avoid, to keep the
effect of the commit the upstream has to have "two" on that line), but at
the same time when you do not agree with the change that the upstream took
for the _current commit_ you are replaying (i.e. you want the final result
to have "one", not "modified one" which the upstream has applied).

The conflict resolution to come up with such an incremental change is very
painful. You have to avoid the "s/two/one/" revert, and you have to keep
the "s/modified one/one" revert, and you need to know which hunks are
conflicting due to what (i.e. the former is because a patch similar to the
one you haven't replayed in this rebase session is already in upstream,
the latter is the upstream tweaked the current patch you are looking at in
a way you do not agree with).

I do not have a good idea to solve this in mind yet.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-11 23:21       ` Junio C Hamano
@ 2011-07-11 23:42         ` Junio C Hamano
  2011-07-12  0:03         ` Jeff King
  2011-07-12  0:14         ` Illia Bobyr
  2 siblings, 0 replies; 35+ messages in thread
From: Junio C Hamano @ 2011-07-11 23:42 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher

Junio C Hamano <gitster@pobox.com> writes:

> Actually I do not think identifying the ones that can safely skipped is
> such a big issue. The case I am most concerned about is when you see that
> "two reverted back to one" (which you obviously want to avoid, to keep the
> effect of the commit the upstream has to have "two" on that line), but at
> the same time ...

It is even worse if you do not necessarily agree with that "two", which
you might originally have written "t w o". You _still_ want to keep the
version from the upstream (i.e. leaving the result as "two") while
replaying the patch that adds "one" (and correcting the change upstream
did to make it "modified one"), and when it is the "t w o" patch's turn,
you would want to do the same dance to make an incremental correction to
the way the upstream butchered your original change.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-11 23:21       ` Junio C Hamano
  2011-07-11 23:42         ` Junio C Hamano
@ 2011-07-12  0:03         ` Jeff King
  2011-07-12 19:38           ` Clemens Buchacher
  2011-07-12  0:14         ` Illia Bobyr
  2 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2011-07-12  0:03 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher

On Mon, Jul 11, 2011 at 04:21:54PM -0700, Junio C Hamano wrote:

> Actually I do not think identifying the ones that can safely skipped is
> such a big issue. The case I am most concerned about is when you see that
> "two reverted back to one" (which you obviously want to avoid, to keep the
> effect of the commit the upstream has to have "two" on that line), but at
> the same time when you do not agree with the change that the upstream took
> for the _current commit_ you are replaying (i.e. you want the final result
> to have "one", not "modified one" which the upstream has applied).

I'm not sure there's a general solution to that. You can't keep the
commit you want intact, because you are rebasing and therefore building
on top of the other broken commit. So in a history like:

    B'--C'
   /
  A--B--C

You really want to perform the transformation of B to B', but on top of
C (i.e., "git checkout C; git diff B' B | git apply"). But if B and C
are textually related, it's going to conflict horribly. And I don't
think there is a general solution short of a darcs-style patch algebra.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-11 23:21       ` Junio C Hamano
  2011-07-11 23:42         ` Junio C Hamano
  2011-07-12  0:03         ` Jeff King
@ 2011-07-12  0:14         ` Illia Bobyr
  2011-07-12  5:35           ` Jeff King
  2 siblings, 1 reply; 35+ messages in thread
From: Illia Bobyr @ 2011-07-12  0:14 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, git@vger.kernel.org, Jakub Narebski, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð Bjarmason,
	Clemens Buchacher

On 7/11/2011 6:21 PM, Junio C Hamano wrote:
> Jeff King<peff@peff.net>  writes:
>
>> ... But with a commit no top, you get:
>>
>>    <<<<<<<  HEAD
>>    two
>>    =======
>>    one
>>    >>>>>>>  one
>>
>> which looks like you are reverting, because of course you are building
>> on top of the finished series.
> Exactly.
>
>>    $ git rebase master topic
>>    Applying: one
>>    Using index info to reconstruct a base tree...
>>    Falling back to patching base and 3-way merge...
>>    Auto-merging file
>>    CONFLICT (content): Merge conflict in file
>>    Failed to merge in the changes.
>>    Patch failed at 0001 one
>>
>>    hint: this commit may already be in upstream as 1234abcd;
>>    hint: the differences between that commit and this one are:
>>
>>    diff --git a/file b/file
>>    --- a/file
>>    +++ b/file
>>    @@ -1 +1 @@
>>    -modified one
>>    +one
>>
>>    When you have resolved this problem run "git rebase --continue".
>>    If you would prefer to skip this patch, instead run "git rebase --skip".
>>    To restore the original branch and stop rebasing run "git rebase --abort".
> Actually I do not think identifying the ones that can safely skipped is
> such a big issue. The case I am most concerned about is when you see that
> "two reverted back to one" (which you obviously want to avoid, to keep the
> effect of the commit the upstream has to have "two" on that line), but at
> the same time when you do not agree with the change that the upstream took
> for the _current commit_ you are replaying (i.e. you want the final result
> to have "one", not "modified one" which the upstream has applied).
>
> The conflict resolution to come up with such an incremental change is very
> painful. You have to avoid the "s/two/one/" revert, and you have to keep
> the "s/modified one/one" revert, and you need to know which hunks are
> conflicting due to what (i.e. the former is because a patch similar to the
> one you haven't replayed in this rebase session is already in upstream,
> the latter is the upstream tweaked the current patch you are looking at in
> a way you do not agree with).
>
> I do not have a good idea to solve this in mind yet.

I am not 100% sure that my solution is exactly about this problem, but 
it seems to be quite relevant.

I think that if you rebase "step-by-step" by doing, for this particular 
example, something like

$ git rebase master^ topic
$ git rebase master topic

You will first see the /modified one/one/ conflict that you will resolve 
your "two" against and then your second rebase will apply with no conflicts.

I have a set of scripts that help me do this kind of rebases by 
essentially rebasing the topic branch against every single commit on the 
upstream.

This way I can clearly see every conflict as at appears and can fix it 
even in cases when a rebase against the final state would give an 
absolutely unbearable diff.

It definitely is much slower but in my case I find this to be the only 
possible solution.  rerere helps enormously here.
At the same time the less changes are in topic...master the faster it 
would be and the more changes are there the more you benefit from a 
gradual rebase.

Here is the actual use case in case I was not very clear above or in 
case someone is interested.

I have a very longed lived branch that I have to rebase against an 
actively developed master.
Besides been longed live the branch contains a lot of absolutely 
unrelated changes.

The branch is a "next version branch" and contains a result of a two 
year work done by two people who though that their version of the 
application will become master "tomorrow", so they where not too shy to 
change things.

The upstream is developed by another 6 - 8 people over 3 years and they 
have no knowledge of the changes done in the "next version branch".

Here is were I step in and am told to get the changes from the next 
version branch onto the master.  And I am not part of either of these 
two groups of people.

In addition to that one of the guys who developed the next version has 
left the company before I joined and the other one is also gone by now.

While it seems pretty crazy to me, thanks to git rebase, git rerere and 
the "step-by-step" rebase I am at my 9th rebase of this topic branch 
through last year and had only minor problems so far.

Essentially now we have 6 people developing the application "in the 
middle" of its history with one person trying to hold the rest of the 
history on top.
I imagine this as a kind of an acrobatic trick worth showing in a circus :)

Ilya Bobyr

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-12  0:14         ` Illia Bobyr
@ 2011-07-12  5:35           ` Jeff King
  2011-07-12 21:52             ` Illia Bobyr
  0 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2011-07-12  5:35 UTC (permalink / raw)
  To: Illia Bobyr
  Cc: Junio C Hamano, git@vger.kernel.org, Jakub Narebski, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð Bjarmason,
	Clemens Buchacher

On Mon, Jul 11, 2011 at 07:14:16PM -0500, Illia Bobyr wrote:

> I am not 100% sure that my solution is exactly about this problem, but 
> it seems to be quite relevant.
> 
> I think that if you rebase "step-by-step" by doing, for this particular 
> example, something like
> 
> $ git rebase master^ topic
> $ git rebase master topic
> 
> You will first see the /modified one/one/ conflict that you will resolve 
> your "two" against and then your second rebase will apply with no conflicts.
> 
> I have a set of scripts that help me do this kind of rebases by 
> essentially rebasing the topic branch against every single commit on the 
> upstream.

That makes a lot of sense to me as a strategy. Of course, as you
mention, it is horribly slow. And when you do have real conflicts, you
would end up looking at the same conflicts again and again (and as you
mention, rerere can be some help there, though not necessarily
perfect).

> At the same time the less changes are in topic...master the faster it 
> would be and the more changes are there the more you benefit from a 
> gradual rebase.

Yeah, this seems like the real problem to me. It's one thing to rebase
on top of a single series that somebody has applied upstream. But if it
has been 2 weeks, there may be hundreds of commits, and doing hundreds
of rebases is awful. I wonder if you could do better by picking out some
"key" commits in master to rebase on top of using one of:

  1. Divide-and-conquer the commit space. Try the rebase, starting on
     the HEAD. If it works, great. If the user says "this is too hard",
     then find the midpoint between where we tried to rebase and the
     merge base, and try rebasing there. Every time it's too hard, go
     back halfway to the start. Every time it's easy, try the new result
     on top of HEAD.

     So it's basically doing a O(lg n) search backwards for an easy
     place to rebase, and then repeatedly checking if that was a good
     spot (and repeating the backwards search if not). The worst case
     complexity is O(n lg n) rebases. But in practice, you can hopefully
     find the problematic spot in O(lg n), and then everything will just
     work out after 1 or 2 problematic spots.

  2. Use heuristics (like commit message content) to find related
     commits. So if I have a 5-patch series, I can perhaps find the
     likely commits upstream that match my patches, and those are
     good places to try individual rebases. And then I don't care how
     many commits are in master. If I have a 5 patch series, I won't do
     more than 5 rebases.

But I've never tried this in practice. Maybe next time a rebase is ugly
I'll manually work through one of the methods and see how it fares.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-11 22:01     ` Jeff King
  2011-07-11 23:21       ` Junio C Hamano
@ 2011-07-12  6:36       ` Miles Bader
  1 sibling, 0 replies; 35+ messages in thread
From: Miles Bader @ 2011-07-12  6:36 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, git, Jakub Narebski, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð Bjarmason,
	Clemens Buchacher

Jeff King <peff@peff.net> writes:
>> Sounds like a good thing to aim for, but "Object Cache" sounded too much
>> like giving access to the object data that is faster than either loose or
>> packed objects.
>
> Agreed. I'm open to better suggestions for the name. I would have just
> called it a straight out "cache" as it can be used for caching anything,
> but that term is already taken in git. :)

"Info Cache"?  Even more generic, but avoids the loaded term "object"....

-miles

-- 
`Life is a boundless sea of bitterness'

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-11 16:17 ` [PATCH 2/5] add object-cache infrastructure Jeff King
  2011-07-11 16:46   ` Jeff King
  2011-07-11 19:17   ` Junio C Hamano
@ 2011-07-12 10:41   ` Jakub Narebski
  2011-07-12 17:57     ` Jeff King
  2 siblings, 1 reply; 35+ messages in thread
From: Jakub Narebski @ 2011-07-12 10:41 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher,
	Eric Wong

On Mon, 11 July 2011, Jeff King wrote:

> There is sometimes a need to cache some information about an
> object or set of objects persistently across git
> invocations. The notes-cache interface can be used for this,
> but it is very heavyweight and slow for storing small
> values.
> 
> This patch introduces a new API, object-cache, which stores
> a mapping of objects to values in a concise and efficient
> form. See the added API documentation for details.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> As I mentioned earlier, I wanted this to be generic and size-agnostic,
> because I'd also like to try caching patch-ids for git-cherry.

Could this API be generalized to support "reverse cache", for example
to map Subversion revision numbers to Git revision identifiers (for 
git-svn)?

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-12 10:41   ` Jakub Narebski
@ 2011-07-12 17:57     ` Jeff King
  2011-07-12 18:41       ` Junio C Hamano
  0 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2011-07-12 17:57 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: git, Junio C Hamano, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher,
	Eric Wong

On Tue, Jul 12, 2011 at 12:41:35PM +0200, Jakub Narebski wrote:

> On Mon, 11 July 2011, Jeff King wrote:
> 
> > There is sometimes a need to cache some information about an
> > object or set of objects persistently across git
> > invocations. The notes-cache interface can be used for this,
> > but it is very heavyweight and slow for storing small
> > values.
> > 
> > This patch introduces a new API, object-cache, which stores
> > a mapping of objects to values in a concise and efficient
> > form. See the added API documentation for details.
> > 
> > Signed-off-by: Jeff King <peff@peff.net>
> > ---
> > As I mentioned earlier, I wanted this to be generic and size-agnostic,
> > because I'd also like to try caching patch-ids for git-cherry.
> 
> Could this API be generalized to support "reverse cache", for example
> to map Subversion revision numbers to Git revision identifiers (for 
> git-svn)?

I hadn't really considered that. It definitely _could_ be done, but I'm
not sure if it's a good idea, for two reasons:

  1. The in-memory store is based on decorate.[ch], which actually
     stores pointers to objects instead of sha1s as keys. Which keeps
     the hash entries a little smaller, and makes comparisons faster.

     We can only take that shortcut because we know what the keys are
     (i.e., objects, and the in-memory store gets pointers and the disk
     store gets sha1s). So I have some fear that a more generalized form
     may be a little slower.

  2. The disk store uses a binary search over a sorted list of sha1s.
     Generalizing this to "a sequence of bytes" would not be hard. But
     we currently have the option of using the uniform distribution of
     sha1 to make better guesses about our "middle" (see the comments in
     sha1-lookup.c). That assumption does not hold over arbitrary bytes.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-12 17:57     ` Jeff King
@ 2011-07-12 18:41       ` Junio C Hamano
  2011-07-13  6:37         ` Jeff King
  0 siblings, 1 reply; 35+ messages in thread
From: Junio C Hamano @ 2011-07-12 18:41 UTC (permalink / raw)
  To: Jeff King
  Cc: Jakub Narebski, git, Junio C Hamano, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð Bjarmason,
	Clemens Buchacher, Eric Wong

Jeff King <peff@peff.net> writes:

>   2. The disk store uses a binary search over a sorted list of sha1s.
>      Generalizing this to "a sequence of bytes" would not be hard. But
>      we currently have the option of using the uniform distribution of
>      sha1 to make better guesses about our "middle" (see the comments in
>      sha1-lookup.c). That assumption does not hold over arbitrary bytes.

A side note. I notice that the "comment" you refer to appears twice in the
file, and the sha1_pos() function that comes earlier in the file does not
protect itself from overshoot penalty like the sha1_entry_pos() function
does.

Perhaps we should think about unifying them somehow.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-12  0:03         ` Jeff King
@ 2011-07-12 19:38           ` Clemens Buchacher
  2011-07-12 19:45             ` Jeff King
  0 siblings, 1 reply; 35+ messages in thread
From: Clemens Buchacher @ 2011-07-12 19:38 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, git, Jakub Narebski, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð Bjarmason

On Mon, Jul 11, 2011 at 08:03:04PM -0400, Jeff King wrote:
> On Mon, Jul 11, 2011 at 04:21:54PM -0700, Junio C Hamano wrote:
> 
> > Actually I do not think identifying the ones that can safely skipped is
> > such a big issue. The case I am most concerned about is when you see that
> > "two reverted back to one" (which you obviously want to avoid, to keep the
> > effect of the commit the upstream has to have "two" on that line), but at
> > the same time when you do not agree with the change that the upstream took
> > for the _current commit_ you are replaying (i.e. you want the final result
> > to have "one", not "modified one" which the upstream has applied).
> 
> I'm not sure there's a general solution to that. You can't keep the
> commit you want intact, because you are rebasing and therefore building
> on top of the other broken commit. So in a history like:
> 
>     B'--C'
>    /
>   A--B--C
> 
> You really want to perform the transformation of B to B', but on top of
> C (i.e., "git checkout C; git diff B' B | git apply"). But if B and C
> are textually related, it's going to conflict horribly. And I don't
> think there is a general solution short of a darcs-style patch algebra.

FWIW, I tried this in darcs and it has exactly the same problem.

It does have better granularity when detecting changes. For
example, it will recognize the changes of B' in B, even if B
contains non-conflicting hunks on top of the changes in B'. Git
only recognizes identical commits, and this is something where we
could improve without too much difficulty (think per-hunk
patch-ids).

But if the changes in B and B' have conflicts, then darcs will
present the user with the options B' and C, which looks like we
were trying to revert C, just like it does with git.

Clemens

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-12 19:38           ` Clemens Buchacher
@ 2011-07-12 19:45             ` Jeff King
  2011-07-12 21:07               ` Clemens Buchacher
  0 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2011-07-12 19:45 UTC (permalink / raw)
  To: Clemens Buchacher
  Cc: Junio C Hamano, git, Jakub Narebski, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð Bjarmason

On Tue, Jul 12, 2011 at 09:38:44PM +0200, Clemens Buchacher wrote:

> > I'm not sure there's a general solution to that. You can't keep the
> > commit you want intact, because you are rebasing and therefore building
> > on top of the other broken commit. So in a history like:
> > 
> >     B'--C'
> >    /
> >   A--B--C
> > 
> > You really want to perform the transformation of B to B', but on top of
> > C (i.e., "git checkout C; git diff B' B | git apply"). But if B and C
> > are textually related, it's going to conflict horribly. And I don't
> > think there is a general solution short of a darcs-style patch algebra.
> 
> FWIW, I tried this in darcs and it has exactly the same problem.

It has been a long time since I've looked at darcs, but from my
recollection, it will only work with specific patch types. That is, it
works if B and C are commutative. For text patches that touch the same
area, that is not the case. But if "B" were a token-renaming patch, for
example, I think it might work.

Anyway, that is not really relevant to git. I think we decided long ago
that being simple and stupid about the content changes (i.e., blob A
became blob B) is better in general, even when there are a few corner
cases that might have been better off the other way.

> It does have better granularity when detecting changes. For
> example, it will recognize the changes of B' in B, even if B
> contains non-conflicting hunks on top of the changes in B'. Git
> only recognizes identical commits, and this is something where we
> could improve without too much difficulty (think per-hunk
> patch-ids).

I'd be curious to see an example worked out. In my experience, even if
something like patch-ids don't match, it's not a big deal for the hunks
that do match, because when we get to the actual content merge, we will
realize that both sides made the same change to that hunk.  So it's not
like you are getting unrelated conflicts; whatever small part of the
diff made the patch-id different will be the part where you get the
conflict, and the should merge cleanly.

Having said something so general, I'm sure there is probably some corner
case that proves me wrong.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-12 19:45             ` Jeff King
@ 2011-07-12 21:07               ` Clemens Buchacher
  2011-07-12 21:15                 ` Clemens Buchacher
                                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Clemens Buchacher @ 2011-07-12 21:07 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, git, Jakub Narebski, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð Bjarmason

On Tue, Jul 12, 2011 at 03:45:40PM -0400, Jeff King wrote:
>
> It has been a long time since I've looked at darcs, but from my
> recollection, it will only work with specific patch types. That is, it
> works if B and C are commutative. For text patches that touch the same
> area, that is not the case. But if "B" were a token-renaming patch, for
> example, I think it might work.

If they were commutative, we would not have a problem in git
either.

> Anyway, that is not really relevant to git. I think we decided long ago
> that being simple and stupid about the content changes (i.e., blob A
> became blob B) is better in general, even when there are a few corner
> cases that might have been better off the other way.

Yes, but that only applies to git merge. When we talk about
rebasing we are looking at individual patches rather than a single
global merge. For rebase I think "patch algebra" is very relevant,
and we have already implemented a simple patch algebra with
patch-id's.

> > It does have better granularity when detecting changes. For
> > example, it will recognize the changes of B' in B, even if B
> > contains non-conflicting hunks on top of the changes in B'. Git
> > only recognizes identical commits, and this is something where we
> > could improve without too much difficulty (think per-hunk
> > patch-ids).
> 
> I'd be curious to see an example worked out. In my experience, even if
> something like patch-ids don't match, it's not a big deal for the hunks
> that do match, because when we get to the actual content merge, we will
> realize that both sides made the same change to that hunk.  So it's not
> like you are getting unrelated conflicts; whatever small part of the
> diff made the patch-id different will be the part where you get the
> conflict, and the should merge cleanly.

I am reading that last part as "they should not merge cleanly". And
in general I agree. We have to resolve the conflict manually and
it's just a question of how the conflict is presented and resolved.
This is already being discussed some in a different branch of this
thread.

> Having said something so general, I'm sure there is probably some corner
> case that proves me wrong.

Exactly. The case I am talking about is where the patch-id's are
different but there are no conflicts. I have worked out an example
for git and darcs. Below are two scripts to demonstrate. In the
example, the patch-id is different because upstream changes the
patch in a way that does not conflict with the original patch. It
simply adds another change that goes into a different hunk. Git
fails to merge cleanly because the patch-id's are different. It
presents the user with an awkward conflict that looks like a
revert. Darcs, on the other hand, merges cleanly. It recognizes the
fact that all changes from the original patch are contained
upstream and do not conflict with the upstream version. The fact
that more changes are added on top does not bother darcs.

Now, one might argue that this is a corner case. But it's actually
very common. In the example, the patch-id changes because of an
extra change in a different text area. That is indeed unlikely.
However, the same problem will occur in a much more common case.
Let's say we have a patch with 10 hunks. The patch is applied
upstream, with only one difference in one of the hunks.
Subsequently, text areas affected by any of the other hunks change
upstream. When the original patch is rebased on top of that, it
will conflict with the one hunk that was changed in the upstream
version of that patch. And that's ok. Git should not decide which
version is correct. But in addition to that conflict there will
also be conflicts for all the other hunks, which the upstream patch
did _not_ modify. And all of those conflicts will look like
reverts.

I believe that is the main reason why rebase is so painful all the
time.

But I am not saying that we necessarily need a finer granularity of
patch-id's. If we rebase the patch on top of its upstream version
before rebasing it to the upstream head, as was already suggested
elsewhere, the problem described here will also go away on its own.

Clemens

---

#!/bin/sh
#
# Darcs recognizes matching upstream changes
#

testdir=test-darcs

mkdir "$testdir" || exit 1

(
	cd "$testdir"
	mkdir master
	cd master
	darcs init

	for line in $(seq 20)
	do
		echo $line >>file
	done

	darcs add file
	darcs record -a -m initial

	cd ..
	darcs get master side
	cd side
	sed -i '5 s/^.*$/original change/' file
	darcs record -a -m 'original change'

	cd ../master
	sed -i '5 s/.*/original change/' file
	sed -i '15 s/^.*$/with an extra hunk/' file
	darcs record -a -m 'original change'

	sed -i '5 s/.*/modified change/' file
	darcs record -a -m 'modified change'

	darcs pull -a ../side
)

#!/bin/sh
#
# Git does not recognize matching upstream changes
#

testdir=test-darcs

mkdir "$testdir" || exit 1

(
	cd "$testdir"
	mkdir master
	cd master
	darcs init

	for line in $(seq 20)
	do
		echo $line >>file
	done

	darcs add file
	darcs record -a -m initial

	cd ..
	darcs get master side
	cd side
	sed -i '5 s/^.*$/original change/' file
	darcs record -a -m 'original change'

	cd ../master
	sed -i '5 s/.*/original change/' file
	sed -i '15 s/^.*$/with an extra hunk/' file
	darcs record -a -m 'original change'

	sed -i '5 s/.*/modified change/' file
	darcs record -a -m 'modified change'

	darcs pull -a ../side
)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-12 21:07               ` Clemens Buchacher
@ 2011-07-12 21:15                 ` Clemens Buchacher
  2011-07-12 21:36                 ` Jeff King
  2011-07-13  1:33                 ` John Szakmeister
  2 siblings, 0 replies; 35+ messages in thread
From: Clemens Buchacher @ 2011-07-12 21:15 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, git, Jakub Narebski, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð Bjarmason

On Tue, Jul 12, 2011 at 11:07:16PM +0200, Clemens Buchacher wrote:
> 
> Exactly. The case I am talking about is where the patch-id's are
> different but there are no conflicts. I have worked out an example
> for git and darcs. Below are two scripts to demonstrate.

Oops, that was the same script twice. Here are the darcs and git
versions, respectively:

#!/bin/sh
#
# Darcs recognizes matching upstream changes
#

testdir=test-darcs

mkdir "$testdir" || exit 1

(
	cd "$testdir"
	mkdir master
	cd master
	darcs init

	for line in $(seq 20)
	do
		echo $line >>file
	done

	darcs add file
	darcs record -a -m initial

	cd ..
	darcs get master side
	cd side
	sed -i '5 s/^.*$/original change/' file
	darcs record -a -m 'original change'

	cd ../master
	sed -i '5 s/.*/original change/' file
	sed -i '15 s/^.*$/with an extra hunk/' file
	darcs record -a -m 'original change'

	sed -i '5 s/.*/modified change/' file
	darcs record -a -m 'modified change'

	darcs pull -a ../side
)


#!/bin/sh
#
# Git does not recognize matching upstream changes
#

testdir=test-git

mkdir "$testdir" || exit 1

(
	cd "$testdir"
	git init -q

	for line in $(seq 20)
	do
		echo $line >>file
	done

	git add file
	git commit -q -m initial

	git checkout -q -b side
	sed -i '5 s/^.*$/original change/' file
	git add file
	git commit -q -m 'original change'

	git checkout -q master
	sed -i '5 s/.*/original change/' file
	sed -i '15 s/^.*$/with an extra hunk/' file
	git add file
	git commit -q -m 'original change'

	sed -i '5 s/.*/modified change/' file
	git add file
	git commit -q -m 'modified change'

	git rebase master side
	git diff
)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-12 21:07               ` Clemens Buchacher
  2011-07-12 21:15                 ` Clemens Buchacher
@ 2011-07-12 21:36                 ` Jeff King
  2011-07-14  8:04                   ` Clemens Buchacher
  2011-07-13  1:33                 ` John Szakmeister
  2 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2011-07-12 21:36 UTC (permalink / raw)
  To: Clemens Buchacher
  Cc: Junio C Hamano, git, Jakub Narebski, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð Bjarmason

On Tue, Jul 12, 2011 at 11:07:16PM +0200, Clemens Buchacher wrote:

> On Tue, Jul 12, 2011 at 03:45:40PM -0400, Jeff King wrote:
> >
> > It has been a long time since I've looked at darcs, but from my
> > recollection, it will only work with specific patch types. That is, it
> > works if B and C are commutative. For text patches that touch the same
> > area, that is not the case. But if "B" were a token-renaming patch, for
> > example, I think it might work.
> 
> If they were commutative, we would not have a problem in git
> either.

Except that git doesn't support many commutative special forms, like
token-renaming patches.

> > Anyway, that is not really relevant to git. I think we decided long ago
> > that being simple and stupid about the content changes (i.e., blob A
> > became blob B) is better in general, even when there are a few corner
> > cases that might have been better off the other way.
> 
> Yes, but that only applies to git merge. When we talk about
> rebasing we are looking at individual patches rather than a single
> global merge. For rebase I think "patch algebra" is very relevant,
> and we have already implemented a simple patch algebra with
> patch-id's.

I can buy that argument, but I think most of the benefit in darcs comes
from annotating your patch as "this is just renaming 'foo' to 'bar'" and
other special patch types. Because the algebraic properties of those
types is more interesting. And what we really don't want in git is
having to put the burden on users of making those annotations (not just
because it's annoying to do, but because when the annotation doesn't
match what's in the blobs, the results would be extremely confusing).

But if you are proposing that we could do run-time detection on those
sorts of patch properties and use the result in making a better rebase,
I don't think that's a bad idea (though I do wonder if the amount of
code will be worth it).

Again, it has been a long time since I've looked at darcs, and I was
never a serious user of it, so everything I say above may be utterly
wrong.

> > I'd be curious to see an example worked out. In my experience, even if
> > something like patch-ids don't match, it's not a big deal for the hunks
> > that do match, because when we get to the actual content merge, we will
> > realize that both sides made the same change to that hunk.  So it's not
> > like you are getting unrelated conflicts; whatever small part of the
> > diff made the patch-id different will be the part where you get the
> > conflict, and the should merge cleanly.
> 
> I am reading that last part as "they should not merge cleanly".

Sorry, it was supposed to be "...and the rest should merge cleanly". But
I think you got my meaning.

> Exactly. The case I am talking about is where the patch-id's are
> different but there are no conflicts. I have worked out an example
> for git and darcs. Below are two scripts to demonstrate. In the
> example, the patch-id is different because upstream changes the
> patch in a way that does not conflict with the original patch. It
> simply adds another change that goes into a different hunk. Git
> fails to merge cleanly because the patch-id's are different. It
> presents the user with an awkward conflict that looks like a
> revert. Darcs, on the other hand, merges cleanly. It recognizes the
> fact that all changes from the original patch are contained
> upstream and do not conflict with the upstream version. The fact
> that more changes are added on top does not bother darcs.

Ah, OK, I see. Let me try to amend what I said earlier to make sure.

In the normal case of applying patch B on top of patch A, it doesn't
matter if we use per-hunk patch-ids or normal patch-ids. Because even if
we decide to actually go through with the merge of B on top of A, any
hunks that _would have_ had their per-hunk patch-ids match will merge
cleanly.

But in the real world, it is about applying patch Z on top of patches
A..Y, where Z has similar hunks to patch N. And then it _does_ make a
difference, because it is about skipping hunks from Z that are already
in N, but will end up applied on top of Y. And what's in Y and what's in
N may be quite different.

Does that sound right?

> Now, one might argue that this is a corner case. But it's actually
> very common. In the example, the patch-id changes because of an
> extra change in a different text area. That is indeed unlikely.
> However, the same problem will occur in a much more common case.
> Let's say we have a patch with 10 hunks. The patch is applied
> upstream, with only one difference in one of the hunks.
> Subsequently, text areas affected by any of the other hunks change
> upstream. When the original patch is rebased on top of that, it
> will conflict with the one hunk that was changed in the upstream
> version of that patch. And that's ok. Git should not decide which
> version is correct. But in addition to that conflict there will
> also be conflicts for all the other hunks, which the upstream patch
> did _not_ modify. And all of those conflicts will look like
> reverts.

Right, that makes sense to me now.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-12  5:35           ` Jeff King
@ 2011-07-12 21:52             ` Illia Bobyr
  0 siblings, 0 replies; 35+ messages in thread
From: Illia Bobyr @ 2011-07-12 21:52 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, git@vger.kernel.org, Jakub Narebski, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð Bjarmason,
	Clemens Buchacher

On 7/12/2011 12:35 AM, Jeff King wrote:
> On Mon, Jul 11, 2011 at 07:14:16PM -0500, Illia Bobyr wrote:
>
>> I am not 100% sure that my solution is exactly about this problem, but
>> it seems to be quite relevant.
>>
>> I think that if you rebase "step-by-step" by doing, for this particular
>> example, something like
>>
>> $ git rebase master^ topic
>> $ git rebase master topic
>>
>> You will first see the /modified one/one/ conflict that you will resolve
>> your "two" against and then your second rebase will apply with no conflicts.
>>
>> I have a set of scripts that help me do this kind of rebases by
>> essentially rebasing the topic branch against every single commit on the
>> upstream.
> That makes a lot of sense to me as a strategy. Of course, as you
> mention, it is horribly slow. And when you do have real conflicts, you
> would end up looking at the same conflicts again and again (and as you
> mention, rerere can be some help there, though not necessarily
> perfect).

Well, I would like to comment on "horribly slow" a little bit :)
It is slower than a normal rebase, but it is much faster than, at least 
in my case, me doing the conflict resolution on the final versions 
without intermediate steps.

Also, in my case, it is faster than compiling every single commit of the 
topic branch after the rebase.  Something that I just have to do, again, 
because if I do this only with the final version it gives me so much 
errors all over the code, that I can hardly do anything with it.  
Besides I would like the commits that were "incorrectly" merged and that 
introduced the compilation error to contain the fixes, not have all the 
fixes as a final "merge" commit.

In other words, while been slow, it is just one step in a general 
process that, in my case, have steps that are slower.

Also, my guess is that, as rebase is an sh script it may be made faster 
and a step-by-step rebase will became considerably faster as well, if it 
would be rewritten in C.  Though I have not looked a lot inside the 
script, so I might be wrong.

I would like to note that in case of hundreds of rebases with dozens of 
conflicts rerere's help is hard to overestimate.  I have rebased my 
topic branch through at least 500 commits for the last 6 month and as I 
actually do not copy any changes into the master branch, some conflicts 
stay we me for the whole time.

>> At the same time the less changes are in topic...master the faster it
>> would be and the more changes are there the more you benefit from a
>> gradual rebase.
> Yeah, this seems like the real problem to me. It's one thing to rebase
> on top of a single series that somebody has applied upstream. But if it
> has been 2 weeks, there may be hundreds of commits, and doing hundreds
> of rebases is awful. I wonder if you could do better by picking out some
> "key" commits in master to rebase on top of using one of:
>
>    1. Divide-and-conquer the commit space. Try the rebase, starting on
>       the HEAD. If it works, great. If the user says "this is too hard",
>       then find the midpoint between where we tried to rebase and the
>       merge base, and try rebasing there. Every time it's too hard, go
>       back halfway to the start. Every time it's easy, try the new result
>       on top of HEAD.
>
>       So it's basically doing a O(lg n) search backwards for an easy
>       place to rebase, and then repeatedly checking if that was a good
>       spot (and repeating the backwards search if not). The worst case
>       complexity is O(n lg n) rebases. But in practice, you can hopefully
>       find the problematic spot in O(lg n), and then everything will just
>       work out after 1 or 2 problematic spots.

I have problems all over the upstream history :)
And the issue is not in exactly in finding a problem spot.  It is about 
giving the user (that is me in this case) something that he can merge in 
a reasonable amount of time.

I have a topic branch with 174 commits.  It takes my machine about 7 
minutes to rebase it.
If I have a conflict that I do not understand, it may take me, on 
average, two hours to figure out what a conflict is about and fix it.  
Sometimes it may take 4 hours or even more if I have to involve other 
developers.
If my machine will have to do 20 rebases or even 40 and it will present 
me with simple conflicts that I can solve in seconds it would still be 
better than if I will spend hours trying to figure out something and 
give up by saying "this is too hard".  Note that while it rebases I am 
working on something else.

In my case sometimes even the most basic conflicts that arise because of 
a rebase against just one commit may be hard to merge.
If I can avoid even one of these I will be happy to let one of my 
machine cores run for hours :)

Obviously, my case is a wired case caused by not-the-best development 
practices.  But, I guess, one can still consider it as one of the points 
on the curve that approximate this kind of use cases.  A pretty extreme 
point.

>    2. Use heuristics (like commit message content) to find related
>       commits. So if I have a 5-patch series, I can perhaps find the
>       likely commits upstream that match my patches, and those are
>       good places to try individual rebases. And then I don't care how
>       many commits are in master. If I have a 5 patch series, I won't do
>       more than 5 rebases.
>
> But I've never tried this in practice. Maybe next time a rebase is ugly
> I'll manually work through one of the methods and see how it fares.

I guess that I view this problem from a little different angle, as in my 
case, it is not be exactly my own patches that are causing problems, but 
an upstream changes that have other changes base on them.

Now I am guessing, but here is another idea.

I think that one can check the modification history of the lines in the 
master commits we are rebasing against and in all the topic commits, 
similar to what blame does.
Essentially take a set of all lines that were modified by the topic 
branch (along with the context lines) and sets of lines modified by ever 
single commit (without the context lines).
If a commit does not touch lines from the topic branch set it will not 
cause conflicts and we can skip it.  It will just cause offsets in the 
line numbers when the patches will be applied.

If you are rebasing against a lot of changes that are unrelated to your 
topic branch and if they are split into commits correctly, this way, I 
guess, it would be possible to find those that may cause conflicts.
The actual rebase may still be able to go through some of them without 
conflicts but I see this as a first approximation that might save time.  
And it seems to give only false negatives.

Kind of a conflict prediction approximation.

Ilya

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-12 21:07               ` Clemens Buchacher
  2011-07-12 21:15                 ` Clemens Buchacher
  2011-07-12 21:36                 ` Jeff King
@ 2011-07-13  1:33                 ` John Szakmeister
  2 siblings, 0 replies; 35+ messages in thread
From: John Szakmeister @ 2011-07-13  1:33 UTC (permalink / raw)
  To: Clemens Buchacher
  Cc: Jeff King, Junio C Hamano, git, Jakub Narebski, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð

On Tue, Jul 12, 2011 at 5:07 PM, Clemens Buchacher <drizzd@aon.at> wrote:
[snip]
> Now, one might argue that this is a corner case. But it's actually
> very common. In the example, the patch-id changes because of an
> extra change in a different text area. That is indeed unlikely.
> However, the same problem will occur in a much more common case.
> Let's say we have a patch with 10 hunks. The patch is applied
> upstream, with only one difference in one of the hunks.
> Subsequently, text areas affected by any of the other hunks change
> upstream. When the original patch is rebased on top of that, it
> will conflict with the one hunk that was changed in the upstream
> version of that patch. And that's ok. Git should not decide which
> version is correct. But in addition to that conflict there will
> also be conflicts for all the other hunks, which the upstream patch
> did _not_ modify. And all of those conflicts will look like
> reverts.
>
> I believe that is the main reason why rebase is so painful all the
> time.

Clemens, that's a great description of the problem.  I've run into
this several times, and it is really confusing.  I've spent
considerable time tracking down the real conflict... only to find the
real issue was in something non-related and easily resolved.  IMHO, I
agree with you Clemens: this has been my major source of pain.

-John

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-12 18:41       ` Junio C Hamano
@ 2011-07-13  6:37         ` Jeff King
  2011-07-13 17:49           ` Junio C Hamano
  0 siblings, 1 reply; 35+ messages in thread
From: Jeff King @ 2011-07-13  6:37 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jakub Narebski, git, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher,
	Eric Wong

On Tue, Jul 12, 2011 at 11:41:27AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >   2. The disk store uses a binary search over a sorted list of sha1s.
> >      Generalizing this to "a sequence of bytes" would not be hard. But
> >      we currently have the option of using the uniform distribution of
> >      sha1 to make better guesses about our "middle" (see the comments in
> >      sha1-lookup.c). That assumption does not hold over arbitrary bytes.
> 
> A side note. I notice that the "comment" you refer to appears twice in the
> file, and the sha1_pos() function that comes earlier in the file does not
> protect itself from overshoot penalty like the sha1_entry_pos() function
> does.
> 
> Perhaps we should think about unifying them somehow.

It would be easy to implement sha1_entry_pos in terms of sha1_pos by
writing an access function. But it seems unnecessarily slow to add
function call overhead in what should be a fairly tight loop.

OTOH, we do it everywhere else that we call sha1_pos; either it isn't a
big deal, or nobody has bothered to measure and micro-optimize there
yet.

-Peff

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-13  6:37         ` Jeff King
@ 2011-07-13 17:49           ` Junio C Hamano
  0 siblings, 0 replies; 35+ messages in thread
From: Junio C Hamano @ 2011-07-13 17:49 UTC (permalink / raw)
  To: Jeff King
  Cc: Jakub Narebski, git, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher,
	Eric Wong

Jeff King <peff@peff.net> writes:

> It would be easy to implement sha1_entry_pos in terms of sha1_pos by
> writing an access function. But it seems unnecessarily slow to add
> function call overhead in what should be a fairly tight loop.

Yeah, that could also be a double-regression as sha1_pos() implementation
is sloppier and does not protect itself from its guess overshooting the
target like sha1_entry_pos() does. The first step definitely is to remove
the duplicated comment, and the second step would probably be to unify the
"mi" selection logic in sha1_pos() and sha1_entry_pos().

Either the simplicity of the former is sufficient for the users of the
latter, or the users of the former would also benefit from the less
agressive selection of the latter.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-12 21:36                 ` Jeff King
@ 2011-07-14  8:04                   ` Clemens Buchacher
  2011-07-14 16:26                     ` Illia Bobyr
  0 siblings, 1 reply; 35+ messages in thread
From: Clemens Buchacher @ 2011-07-14  8:04 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, git, Jakub Narebski, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð Bjarmason

On Tue, Jul 12, 2011 at 05:36:07PM -0400, Jeff King wrote:
> 
> In the normal case of applying patch B on top of patch A, it doesn't
> matter if we use per-hunk patch-ids or normal patch-ids. Because even if
> we decide to actually go through with the merge of B on top of A, any
> hunks that _would have_ had their per-hunk patch-ids match will merge
> cleanly.
> 
> But in the real world, it is about applying patch Z on top of patches
> A..Y, where Z has similar hunks to patch N. And then it _does_ make a
> difference, because it is about skipping hunks from Z that are already
> in N, but will end up applied on top of Y. And what's in Y and what's in
> N may be quite different.
> 
> Does that sound right?

Yes, exactly.

And one possible solution would be to drop all hunks from Z which
are already somewhere in A..Y. But that undermines the whole
changeset idea.

If we detect the similarities between Z and N, then we could rebase
Z to N, make the user resolve any conflicts, which should make more
sense than what we would have between Z and Y. Then we have Z' on
top of N:

  Z      Z'     Z"
 /      /      /
A--..--N--..--Y

Subsequently we rebase Z' to Y, at which point only changes remain
that we disagreed with. For those we may have to do conflict
resolution again. So, in some cases this approach could result in
more work.

Clemens

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/5] add object-cache infrastructure
  2011-07-14  8:04                   ` Clemens Buchacher
@ 2011-07-14 16:26                     ` Illia Bobyr
  0 siblings, 0 replies; 35+ messages in thread
From: Illia Bobyr @ 2011-07-14 16:26 UTC (permalink / raw)
  To: Clemens Buchacher
  Cc: Jeff King, Junio C Hamano, git@vger.kernel.org, Jakub Narebski,
	Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason

On 7/14/2011 3:04 AM, Clemens Buchacher wrote:
> On Tue, Jul 12, 2011 at 05:36:07PM -0400, Jeff King wrote:
>> In the normal case of applying patch B on top of patch A, it doesn't
>> matter if we use per-hunk patch-ids or normal patch-ids. Because even if
>> we decide to actually go through with the merge of B on top of A, any
>> hunks that _would have_ had their per-hunk patch-ids match will merge
>> cleanly.
>>
>> But in the real world, it is about applying patch Z on top of patches
>> A..Y, where Z has similar hunks to patch N. And then it _does_ make a
>> difference, because it is about skipping hunks from Z that are already
>> in N, but will end up applied on top of Y. And what's in Y and what's in
>> N may be quite different.
>>
>> Does that sound right?
> Yes, exactly.
>
> And one possible solution would be to drop all hunks from Z which
> are already somewhere in A..Y. But that undermines the whole
> changeset idea.
>
> If we detect the similarities between Z and N, then we could rebase
> Z to N, make the user resolve any conflicts, which should make more
> sense than what we would have between Z and Y. Then we have Z' on
> top of N:
>
>    Z      Z'     Z"
>   /      /      /
> A--..--N--..--Y
>
> Subsequently we rebase Z' to Y, at which point only changes remain
> that we disagreed with. For those we may have to do conflict
> resolution again. So, in some cases this approach could result in
> more work.

This is where rerere helps, AFAIU.
And if the conflict is non-trivial there is a chance that it is really 
something you would like to take a look at.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2011-07-14 16:26 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-11 16:13 [RFC/PATCH 0/5] generation numbers for faster traversals Jeff King
2011-07-11 16:16 ` [PATCH 1/5] decorate: allow storing values instead of pointers Jeff King
2011-07-11 16:39   ` Jeff King
2011-07-11 19:06   ` Junio C Hamano
2011-07-11 21:20     ` Jeff King
2011-07-11 16:17 ` [PATCH 2/5] add object-cache infrastructure Jeff King
2011-07-11 16:46   ` Jeff King
2011-07-11 16:58     ` Shawn Pearce
2011-07-11 19:17   ` Junio C Hamano
2011-07-11 22:01     ` Jeff King
2011-07-11 23:21       ` Junio C Hamano
2011-07-11 23:42         ` Junio C Hamano
2011-07-12  0:03         ` Jeff King
2011-07-12 19:38           ` Clemens Buchacher
2011-07-12 19:45             ` Jeff King
2011-07-12 21:07               ` Clemens Buchacher
2011-07-12 21:15                 ` Clemens Buchacher
2011-07-12 21:36                 ` Jeff King
2011-07-14  8:04                   ` Clemens Buchacher
2011-07-14 16:26                     ` Illia Bobyr
2011-07-13  1:33                 ` John Szakmeister
2011-07-12  0:14         ` Illia Bobyr
2011-07-12  5:35           ` Jeff King
2011-07-12 21:52             ` Illia Bobyr
2011-07-12  6:36       ` Miles Bader
2011-07-12 10:41   ` Jakub Narebski
2011-07-12 17:57     ` Jeff King
2011-07-12 18:41       ` Junio C Hamano
2011-07-13  6:37         ` Jeff King
2011-07-13 17:49           ` Junio C Hamano
2011-07-11 16:18 ` [PATCH 3/5] commit: add commit_generation function Jeff King
2011-07-11 17:57   ` Clemens Buchacher
2011-07-11 21:10     ` Jeff King
2011-07-11 16:18 ` [PATCH 4/5] pretty: support %G to show the generation number of a commit Jeff King
2011-07-11 16:18 ` [PATCH 5/5] limit "contains" traversals based on commit generation Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).