[PATCH v2 00/19] Index-v5

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v2 00/19] Index-v5
@ 2013-07-12 17:26 Thomas Gummerer
  2013-07-12 17:26 ` [PATCH v2 01/19] t2104: Don't fail for index versions other than [23] Thomas Gummerer
                   ` (20 more replies)
  0 siblings, 21 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Hi,

previous rounds (without api) are at $gmane/202752, $gmane/202923,
$gmane/203088 and $gmane/203517, the previous round with api was at
$gmane/229732.  Thanks to Junio, Duy and Eric for their comments on
the previous round.

Changes since the previous round:

Add documentation for the index reading api (was sent later in the
previous series as 5.5/22).

read-cache: add index reading api

  - Remove the next_ce pointer, and use index_state->cache[] as part
    of the api, making a few functions unnecessary (next_index_entry,
    get_index_entry_by_name, sort_index). Those were removed.

  - Replace const char **pathspec with const struct pathspec

  - remove the index_change_filter_opts function.  For better
    performance the caller should make up its mind before reading
    anything which part of the index file should be read.  Changing
    the filter_opts later makes us read part of the index again, which
    is sub-optimal.

  - set the filter_opts to NULL when discarding the cache.

  - discussed and didn't change: replacing the pathspec with
    tree_entry_interesting doesn't work well.  Using common_prefix()
    to calculate the adjusted_pathspec would work, but has some
    disadvantages when glob is used in the pathspec.

make sure partially read index is not changed

  - instead of using the index_change_filter_opts function to set the
    filter_opts to NULL before changing the index, die() when the
    caller tries that.  

  - use filter_opts to check if the index was partially read instead
    of an extra flag.

  - die() when trying to re-read a partially read index.

patches (6,7,8,9,12/22) were removed, as they are no longer needed
when keeping index_state->cache[] as part of the api.

ls-files: use index api

  - only use the new api for loading the filtered index, but don't use
    the for_each_index_entry function, as it would need to change the
    filter_opts after reading the index.  That will be made obsolete
    by a series that Duy has cooking [1], after which the loops can be
    switched out by using for_each_index_entry.

Other than that, the typos found by Eric were fixed.

[1] http://thread.gmane.org/gmane.comp.version-control.git/229732/focus=230023

Thomas Gummerer (18):
  t2104: Don't fail for index versions other than [23]
  read-cache: split index file version specific functionality
  read-cache: move index v2 specific functions to their own file
  read-cache: Re-read index if index file changed
  Add documentation for the index api
  read-cache: add index reading api
  make sure partially read index is not changed
  grep.c: Use index api
  ls-files.c: use index api
  documentation: add documentation of the index-v5 file format
  read-cache: make in-memory format aware of stat_crc
  read-cache: read index-v5
  read-cache: read resolve-undo data
  read-cache: read cache-tree in index-v5
  read-cache: write index-v5
  read-cache: write index-v5 cache-tree data
  read-cache: write resolve-undo data for index-v5
  update-index.c: rewrite index when index-version is given

Thomas Rast (1):
  p0003-index.sh: add perf test for the index formats

 Documentation/technical/api-in-core-index.txt    |   54 +-
 Documentation/technical/index-file-format-v5.txt |  296 +++++
 Makefile                                         |    3 +
 builtin/grep.c                                   |   71 +-
 builtin/ls-files.c                               |   31 +-
 builtin/update-index.c                           |    8 +-
 cache-tree.c                                     |    2 +-
 cache-tree.h                                     |    6 +
 cache.h                                          |  140 +-
 read-cache-v2.c                                  |  589 +++++++++
 read-cache-v5.c                                  | 1516 ++++++++++++++++++++++
 read-cache.c                                     |  676 +++-------
 read-cache.h                                     |   65 +
 t/perf/p0003-index.sh                            |   59 +
 t/t2104-update-index-skip-worktree.sh            |    1 +
 test-index-version.c                             |    7 +-
 16 files changed, 2942 insertions(+), 582 deletions(-)
 create mode 100644 Documentation/technical/index-file-format-v5.txt
 create mode 100644 read-cache-v2.c
 create mode 100644 read-cache-v5.c
 create mode 100644 read-cache.h
 create mode 100755 t/perf/p0003-index.sh

-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2 01/19] t2104: Don't fail for index versions other than [23]
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
@ 2013-07-12 17:26 ` Thomas Gummerer
  2013-07-12 17:26 ` [PATCH v2 02/19] read-cache: split index file version specific functionality Thomas Gummerer
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

t2104 currently checks for the exact index version 2 or 3,
depending if there is a skip-worktree flag or not. Other
index versions do not use extended flags and thus cannot
be tested for version changes.

Make this test update the index to version 2 at the beginning
of the test. Testing the skip-worktree flags for the default
index format is still covered by t7011 and t7012.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 t/t2104-update-index-skip-worktree.sh | 1 +
 1 file changed, 1 insertion(+)

diff --git a/t/t2104-update-index-skip-worktree.sh b/t/t2104-update-index-skip-worktree.sh
index 1d0879b..bd9644f 100755
--- a/t/t2104-update-index-skip-worktree.sh
+++ b/t/t2104-update-index-skip-worktree.sh
@@ -22,6 +22,7 @@ H sub/2
 EOF
 
 test_expect_success 'setup' '
+	git update-index --index-version=2 &&
 	mkdir sub &&
 	touch ./1 ./2 sub/1 sub/2 &&
 	git add 1 2 sub/1 sub/2 &&
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 02/19] read-cache: split index file version specific functionality
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
  2013-07-12 17:26 ` [PATCH v2 01/19] t2104: Don't fail for index versions other than [23] Thomas Gummerer
@ 2013-07-12 17:26 ` Thomas Gummerer
  2013-07-12 17:26 ` [PATCH v2 03/19] read-cache: move index v2 specific functions to their own file Thomas Gummerer
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Split index file version specific functionality to their own functions,
to prepare for moving the index file version specific parts to their own
file.  This makes it easier to add a new index file format later.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 cache.h              |   5 +-
 read-cache.c         | 130 +++++++++++++++++++++++++++++++++------------------
 test-index-version.c |   2 +-
 3 files changed, 90 insertions(+), 47 deletions(-)

diff --git a/cache.h b/cache.h
index c288678..7af853b 100644
--- a/cache.h
+++ b/cache.h
@@ -100,9 +100,12 @@ unsigned long git_deflate_bound(git_zstream *, unsigned long);
  */
 
 #define CACHE_SIGNATURE 0x44495243	/* "DIRC" */
-struct cache_header {
+struct cache_version_header {
 	unsigned int hdr_signature;
 	unsigned int hdr_version;
+};
+
+struct cache_header {
 	unsigned int hdr_entries;
 };
 
diff --git a/read-cache.c b/read-cache.c
index d5201f9..93947bf 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1268,10 +1268,8 @@ struct ondisk_cache_entry_extended {
 			    ondisk_cache_entry_extended_size(ce_namelen(ce)) : \
 			    ondisk_cache_entry_size(ce_namelen(ce)))
 
-static int verify_hdr(struct cache_header *hdr, unsigned long size)
+static int verify_hdr_version(struct cache_version_header *hdr, unsigned long size)
 {
-	git_SHA_CTX c;
-	unsigned char sha1[20];
 	int hdr_version;
 
 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
@@ -1279,10 +1277,22 @@ static int verify_hdr(struct cache_header *hdr, unsigned long size)
 	hdr_version = ntohl(hdr->hdr_version);
 	if (hdr_version < INDEX_FORMAT_LB || INDEX_FORMAT_UB < hdr_version)
 		return error("bad index version %d", hdr_version);
+	return 0;
+}
+
+static int verify_hdr(void *mmap, unsigned long size)
+{
+	git_SHA_CTX c;
+	unsigned char sha1[20];
+
+	if (size < sizeof(struct cache_version_header)
+	    + sizeof(struct cache_header) + 20)
+		die("index file smaller than expected");
+
 	git_SHA1_Init(&c);
-	git_SHA1_Update(&c, hdr, size - 20);
+	git_SHA1_Update(&c, mmap, size - 20);
 	git_SHA1_Final(sha1, &c);
-	if (hashcmp(sha1, (unsigned char *)hdr + size - 20))
+	if (hashcmp(sha1, (unsigned char *)mmap + size - 20))
 		return error("bad index file sha1 signature");
 	return 0;
 }
@@ -1424,47 +1434,19 @@ static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk,
 	return ce;
 }
 
-/* remember to discard_cache() before reading a different cache! */
-int read_index_from(struct index_state *istate, const char *path)
+static int read_index_v2(struct index_state *istate, void *mmap, unsigned long mmap_size)
 {
-	int fd, i;
-	struct stat st;
+	int i;
 	unsigned long src_offset;
-	struct cache_header *hdr;
-	void *mmap;
-	size_t mmap_size;
+	struct cache_version_header *hdr;
+	struct cache_header *hdr_v2;
 	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
 
-	if (istate->initialized)
-		return istate->cache_nr;
-
-	istate->timestamp.sec = 0;
-	istate->timestamp.nsec = 0;
-	fd = open(path, O_RDONLY);
-	if (fd < 0) {
-		if (errno == ENOENT)
-			return 0;
-		die_errno("index file open failed");
-	}
-
-	if (fstat(fd, &st))
-		die_errno("cannot stat the open index");
-
-	mmap_size = xsize_t(st.st_size);
-	if (mmap_size < sizeof(struct cache_header) + 20)
-		die("index file smaller than expected");
-
-	mmap = xmmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
-	if (mmap == MAP_FAILED)
-		die_errno("unable to map index file");
-	close(fd);
-
 	hdr = mmap;
-	if (verify_hdr(hdr, mmap_size) < 0)
-		goto unmap;
+	hdr_v2 = (struct cache_header *)((char *)mmap + sizeof(*hdr));
 
 	istate->version = ntohl(hdr->hdr_version);
-	istate->cache_nr = ntohl(hdr->hdr_entries);
+	istate->cache_nr = ntohl(hdr_v2->hdr_entries);
 	istate->cache_alloc = alloc_nr(istate->cache_nr);
 	istate->cache = xcalloc(istate->cache_alloc, sizeof(*istate->cache));
 	istate->initialized = 1;
@@ -1474,7 +1456,7 @@ int read_index_from(struct index_state *istate, const char *path)
 	else
 		previous_name = NULL;
 
-	src_offset = sizeof(*hdr);
+	src_offset = sizeof(*hdr) + sizeof(*hdr_v2);
 	for (i = 0; i < istate->cache_nr; i++) {
 		struct ondisk_cache_entry *disk_ce;
 		struct cache_entry *ce;
@@ -1487,8 +1469,6 @@ int read_index_from(struct index_state *istate, const char *path)
 		src_offset += consumed;
 	}
 	strbuf_release(&previous_name_buf);
-	istate->timestamp.sec = st.st_mtime;
-	istate->timestamp.nsec = ST_MTIME_NSEC(st);
 
 	while (src_offset <= mmap_size - 20 - 8) {
 		/* After an array of active_nr index entries,
@@ -1508,6 +1488,58 @@ int read_index_from(struct index_state *istate, const char *path)
 		src_offset += 8;
 		src_offset += extsize;
 	}
+	return 0;
+unmap:
+	munmap(mmap, mmap_size);
+	die("index file corrupt");
+}
+
+/* remember to discard_cache() before reading a different cache! */
+int read_index_from(struct index_state *istate, const char *path)
+{
+	int fd;
+	struct stat st;
+	struct cache_version_header *hdr;
+	void *mmap;
+	size_t mmap_size;
+
+	errno = EBUSY;
+	if (istate->initialized)
+		return istate->cache_nr;
+
+	errno = ENOENT;
+	istate->timestamp.sec = 0;
+	istate->timestamp.nsec = 0;
+	fd = open(path, O_RDONLY);
+	if (fd < 0) {
+		if (errno == ENOENT)
+			return 0;
+		die_errno("index file open failed");
+	}
+
+	if (fstat(fd, &st))
+		die_errno("cannot stat the open index");
+
+	errno = EINVAL;
+	mmap_size = xsize_t(st.st_size);
+	if (mmap_size < sizeof(struct cache_header) + 20)
+		die("index file smaller than expected");
+
+	mmap = xmmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
+	close(fd);
+	if (mmap == MAP_FAILED)
+		die_errno("unable to map index file");
+
+	hdr = mmap;
+	if (verify_hdr_version(hdr, mmap_size) < 0)
+		goto unmap;
+
+	if (verify_hdr(mmap, mmap_size) < 0)
+		goto unmap;
+
+	read_index_v2(istate, mmap, mmap_size);
+	istate->timestamp.sec = st.st_mtime;
+	istate->timestamp.nsec = ST_MTIME_NSEC(st);
 	munmap(mmap, mmap_size);
 	return istate->cache_nr;
 
@@ -1771,10 +1803,11 @@ void update_index_if_able(struct index_state *istate, struct lock_file *lockfile
 		rollback_lock_file(lockfile);
 }
 
-int write_index(struct index_state *istate, int newfd)
+static int write_index_v2(struct index_state *istate, int newfd)
 {
 	git_SHA_CTX c;
-	struct cache_header hdr;
+	struct cache_version_header hdr;
+	struct cache_header hdr_v2;
 	int i, err, removed, extended, hdr_version;
 	struct cache_entry **cache = istate->cache;
 	int entries = istate->cache_nr;
@@ -1804,11 +1837,13 @@ int write_index(struct index_state *istate, int newfd)
 
 	hdr.hdr_signature = htonl(CACHE_SIGNATURE);
 	hdr.hdr_version = htonl(hdr_version);
-	hdr.hdr_entries = htonl(entries - removed);
+	hdr_v2.hdr_entries = htonl(entries - removed);
 
 	git_SHA1_Init(&c);
 	if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0)
 		return -1;
+	if (ce_write(&c, newfd, &hdr_v2, sizeof(hdr_v2)) < 0)
+		return -1;
 
 	previous_name = (hdr_version == 4) ? &previous_name_buf : NULL;
 	for (i = 0; i < entries; i++) {
@@ -1854,6 +1889,11 @@ int write_index(struct index_state *istate, int newfd)
 	return 0;
 }
 
+int write_index(struct index_state *istate, int newfd)
+{
+	return write_index_v2(istate, newfd);
+}
+
 /*
  * Read the index file that is potentially unmerged into given
  * index_state, dropping any unmerged entries.  Returns true if
diff --git a/test-index-version.c b/test-index-version.c
index 05d4699..4c0386f 100644
--- a/test-index-version.c
+++ b/test-index-version.c
@@ -2,7 +2,7 @@
 
 int main(int argc, char **argv)
 {
-	struct cache_header hdr;
+	struct cache_version_header hdr;
 	int version;
 
 	memset(&hdr,0,sizeof(hdr));
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 03/19] read-cache: move index v2 specific functions to their own file
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
  2013-07-12 17:26 ` [PATCH v2 01/19] t2104: Don't fail for index versions other than [23] Thomas Gummerer
  2013-07-12 17:26 ` [PATCH v2 02/19] read-cache: split index file version specific functionality Thomas Gummerer
@ 2013-07-12 17:26 ` Thomas Gummerer
  2013-07-14  3:10   ` Duy Nguyen
  2013-07-12 17:26 ` [PATCH v2 04/19] read-cache: Re-read index if index file changed Thomas Gummerer
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Move index version 2 specific functions to their own file. The non-index
specific functions will be in read-cache.c, while the index version 2
specific functions will be in read-cache-v2.c.

Helped-by: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 Makefile             |   2 +
 cache.h              |  16 +-
 read-cache-v2.c      | 556 +++++++++++++++++++++++++++++++++++++++++++++++++
 read-cache.c         | 575 ++++-----------------------------------------------
 read-cache.h         |  57 +++++
 test-index-version.c |   5 +
 6 files changed, 661 insertions(+), 550 deletions(-)
 create mode 100644 read-cache-v2.c
 create mode 100644 read-cache.h

diff --git a/Makefile b/Makefile
index 5a68fe5..73369ae 100644
--- a/Makefile
+++ b/Makefile
@@ -711,6 +711,7 @@ LIB_H += progress.h
 LIB_H += prompt.h
 LIB_H += quote.h
 LIB_H += reachable.h
+LIB_H += read-cache.h
 LIB_H += reflog-walk.h
 LIB_H += refs.h
 LIB_H += remote.h
@@ -854,6 +855,7 @@ LIB_OBJS += prompt.o
 LIB_OBJS += quote.o
 LIB_OBJS += reachable.o
 LIB_OBJS += read-cache.o
+LIB_OBJS += read-cache-v2.o
 LIB_OBJS += reflog-walk.o
 LIB_OBJS += refs.o
 LIB_OBJS += remote.o
diff --git a/cache.h b/cache.h
index 7af853b..5082b34 100644
--- a/cache.h
+++ b/cache.h
@@ -95,19 +95,8 @@ unsigned long git_deflate_bound(git_zstream *, unsigned long);
  */
 #define DEFAULT_GIT_PORT 9418
 
-/*
- * Basic data structures for the directory cache
- */
 
 #define CACHE_SIGNATURE 0x44495243	/* "DIRC" */
-struct cache_version_header {
-	unsigned int hdr_signature;
-	unsigned int hdr_version;
-};
-
-struct cache_header {
-	unsigned int hdr_entries;
-};
 
 #define INDEX_FORMAT_LB 2
 #define INDEX_FORMAT_UB 4
@@ -280,6 +269,7 @@ struct index_state {
 		 initialized : 1;
 	struct hash_table name_hash;
 	struct hash_table dir_hash;
+	struct index_ops *ops;
 };
 
 extern struct index_state the_index;
@@ -489,8 +479,8 @@ extern void *read_blob_data_from_index(struct index_state *, const char *, unsig
 #define CE_MATCH_RACY_IS_DIRTY		02
 /* do stat comparison even if CE_SKIP_WORKTREE is true */
 #define CE_MATCH_IGNORE_SKIP_WORKTREE	04
-extern int ie_match_stat(const struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
-extern int ie_modified(const struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
+extern int ie_match_stat(struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
+extern int ie_modified(struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
 
 #define PATHSPEC_ONESTAR 1	/* the pathspec pattern sastisfies GFNM_ONESTAR */
 
diff --git a/read-cache-v2.c b/read-cache-v2.c
new file mode 100644
index 0000000..a6883c3
--- /dev/null
+++ b/read-cache-v2.c
@@ -0,0 +1,556 @@
+#include "cache.h"
+#include "read-cache.h"
+#include "resolve-undo.h"
+#include "cache-tree.h"
+#include "varint.h"
+
+/* Mask for the name length in ce_flags in the on-disk index */
+#define CE_NAMEMASK  (0x0fff)
+
+struct cache_header {
+	unsigned int hdr_entries;
+};
+
+/*****************************************************************
+ * Index File I/O
+ *****************************************************************/
+
+/*
+ * dev/ino/uid/gid/size are also just tracked to the low 32 bits
+ * Again - this is just a (very strong in practice) heuristic that
+ * the inode hasn't changed.
+ *
+ * We save the fields in big-endian order to allow using the
+ * index file over NFS transparently.
+ */
+struct ondisk_cache_entry {
+	struct cache_time ctime;
+	struct cache_time mtime;
+	unsigned int dev;
+	unsigned int ino;
+	unsigned int mode;
+	unsigned int uid;
+	unsigned int gid;
+	unsigned int size;
+	unsigned char sha1[20];
+	unsigned short flags;
+	char name[FLEX_ARRAY]; /* more */
+};
+
+/*
+ * This struct is used when CE_EXTENDED bit is 1
+ * The struct must match ondisk_cache_entry exactly from
+ * ctime till flags
+ */
+struct ondisk_cache_entry_extended {
+	struct cache_time ctime;
+	struct cache_time mtime;
+	unsigned int dev;
+	unsigned int ino;
+	unsigned int mode;
+	unsigned int uid;
+	unsigned int gid;
+	unsigned int size;
+	unsigned char sha1[20];
+	unsigned short flags;
+	unsigned short flags2;
+	char name[FLEX_ARRAY]; /* more */
+};
+
+/* These are only used for v3 or lower */
+#define align_flex_name(STRUCT,len) ((offsetof(struct STRUCT,name) + (len) + 8) & ~7)
+#define ondisk_cache_entry_size(len) align_flex_name(ondisk_cache_entry,len)
+#define ondisk_cache_entry_extended_size(len) align_flex_name(ondisk_cache_entry_extended,len)
+#define ondisk_ce_size(ce) (((ce)->ce_flags & CE_EXTENDED) ? \
+			    ondisk_cache_entry_extended_size(ce_namelen(ce)) : \
+			    ondisk_cache_entry_size(ce_namelen(ce)))
+
+static int verify_hdr(void *mmap, unsigned long size)
+{
+	git_SHA_CTX c;
+	unsigned char sha1[20];
+
+	if (size < sizeof(struct cache_version_header)
+			+ sizeof(struct cache_header) + 20)
+		die("index file smaller than expected");
+
+	git_SHA1_Init(&c);
+	git_SHA1_Update(&c, mmap, size - 20);
+	git_SHA1_Final(sha1, &c);
+	if (hashcmp(sha1, (unsigned char *)mmap + size - 20))
+		return error("bad index file sha1 signature");
+	return 0;
+}
+
+static int match_stat_basic(const struct cache_entry *ce,
+			    struct stat *st, int changed)
+{
+	changed |= match_stat_data(&ce->ce_stat_data, st);
+
+	/* Racily smudged entry? */
+	if (!ce->ce_stat_data.sd_size) {
+		if (!is_empty_blob_sha1(ce->sha1))
+			changed |= DATA_CHANGED;
+	}
+	return changed;
+}
+
+static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk,
+						   unsigned int flags,
+						   const char *name,
+						   size_t len)
+{
+	struct cache_entry *ce = xmalloc(cache_entry_size(len));
+
+	ce->ce_stat_data.sd_ctime.sec = ntoh_l(ondisk->ctime.sec);
+	ce->ce_stat_data.sd_mtime.sec = ntoh_l(ondisk->mtime.sec);
+	ce->ce_stat_data.sd_ctime.nsec = ntoh_l(ondisk->ctime.nsec);
+	ce->ce_stat_data.sd_mtime.nsec = ntoh_l(ondisk->mtime.nsec);
+	ce->ce_stat_data.sd_dev   = ntoh_l(ondisk->dev);
+	ce->ce_stat_data.sd_ino   = ntoh_l(ondisk->ino);
+	ce->ce_mode  = ntoh_l(ondisk->mode);
+	ce->ce_stat_data.sd_uid   = ntoh_l(ondisk->uid);
+	ce->ce_stat_data.sd_gid   = ntoh_l(ondisk->gid);
+	ce->ce_stat_data.sd_size  = ntoh_l(ondisk->size);
+	ce->ce_flags = flags & ~CE_NAMEMASK;
+	ce->ce_namelen = len;
+	hashcpy(ce->sha1, ondisk->sha1);
+	memcpy(ce->name, name, len);
+	ce->name[len] = '\0';
+	return ce;
+}
+
+/*
+ * Adjacent cache entries tend to share the leading paths, so it makes
+ * sense to only store the differences in later entries.  In the v4
+ * on-disk format of the index, each on-disk cache entry stores the
+ * number of bytes to be stripped from the end of the previous name,
+ * and the bytes to append to the result, to come up with its name.
+ */
+static unsigned long expand_name_field(struct strbuf *name, const char *cp_)
+{
+	const unsigned char *ep, *cp = (const unsigned char *)cp_;
+	size_t len = decode_varint(&cp);
+
+	if (name->len < len)
+		die("malformed name field in the index");
+	strbuf_remove(name, name->len - len, len);
+	for (ep = cp; *ep; ep++)
+		; /* find the end */
+	strbuf_add(name, cp, ep - cp);
+	return (const char *)ep + 1 - cp_;
+}
+
+static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk,
+					    unsigned long *ent_size,
+					    struct strbuf *previous_name)
+{
+	struct cache_entry *ce;
+	size_t len;
+	const char *name;
+	unsigned int flags;
+
+	/* On-disk flags are just 16 bits */
+	flags = ntoh_s(ondisk->flags);
+	len = flags & CE_NAMEMASK;
+
+	if (flags & CE_EXTENDED) {
+		struct ondisk_cache_entry_extended *ondisk2;
+		int extended_flags;
+		ondisk2 = (struct ondisk_cache_entry_extended *)ondisk;
+		extended_flags = ntoh_s(ondisk2->flags2) << 16;
+		/* We do not yet understand any bit out of CE_EXTENDED_FLAGS */
+		if (extended_flags & ~CE_EXTENDED_FLAGS)
+			die("Unknown index entry format %08x", extended_flags);
+		flags |= extended_flags;
+		name = ondisk2->name;
+	}
+	else
+		name = ondisk->name;
+
+	if (!previous_name) {
+		/* v3 and earlier */
+		if (len == CE_NAMEMASK)
+			len = strlen(name);
+		ce = cache_entry_from_ondisk(ondisk, flags, name, len);
+
+		*ent_size = ondisk_ce_size(ce);
+	} else {
+		unsigned long consumed;
+		consumed = expand_name_field(previous_name, name);
+		ce = cache_entry_from_ondisk(ondisk, flags,
+					     previous_name->buf,
+					     previous_name->len);
+
+		*ent_size = (name - ((char *)ondisk)) + consumed;
+	}
+	return ce;
+}
+
+static int read_index_extension(struct index_state *istate,
+				const char *ext, void *data, unsigned long sz)
+{
+	switch (CACHE_EXT(ext)) {
+	case CACHE_EXT_TREE:
+		istate->cache_tree = cache_tree_read(data, sz);
+		break;
+	case CACHE_EXT_RESOLVE_UNDO:
+		istate->resolve_undo = resolve_undo_read(data, sz);
+		break;
+	default:
+		if (*ext < 'A' || 'Z' < *ext)
+			return error("index uses %.4s extension, which we do not understand",
+				     ext);
+		fprintf(stderr, "ignoring %.4s extension\n", ext);
+		break;
+	}
+	return 0;
+}
+
+static int read_index_v2(struct index_state *istate, void *mmap,
+			 unsigned long mmap_size)
+{
+	int i;
+	unsigned long src_offset;
+	struct cache_version_header *hdr;
+	struct cache_header *hdr_v2;
+	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
+
+	hdr = mmap;
+	hdr_v2 = (struct cache_header *)((char *)mmap + sizeof(*hdr));
+	istate->version = ntohl(hdr->hdr_version);
+	istate->cache_nr = ntohl(hdr_v2->hdr_entries);
+	istate->cache_alloc = alloc_nr(istate->cache_nr);
+	istate->cache = xcalloc(istate->cache_alloc, sizeof(struct cache_entry *));
+	istate->initialized = 1;
+
+	if (istate->version == 4)
+		previous_name = &previous_name_buf;
+	else
+		previous_name = NULL;
+
+	src_offset = sizeof(*hdr) + sizeof(*hdr_v2);
+	for (i = 0; i < istate->cache_nr; i++) {
+		struct ondisk_cache_entry *disk_ce;
+		struct cache_entry *ce;
+		unsigned long consumed;
+
+		disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
+		ce = create_from_disk(disk_ce, &consumed, previous_name);
+		set_index_entry(istate, i, ce);
+
+		src_offset += consumed;
+	}
+	strbuf_release(&previous_name_buf);
+
+	while (src_offset <= mmap_size - 20 - 8) {
+		/* After an array of active_nr index entries,
+		 * there can be arbitrary number of extended
+		 * sections, each of which is prefixed with
+		 * extension name (4-byte) and section length
+		 * in 4-byte network byte order.
+		 */
+		uint32_t extsize;
+		memcpy(&extsize, (char *)mmap + src_offset + 4, 4);
+		extsize = ntohl(extsize);
+		if (read_index_extension(istate,
+					(const char *) mmap + src_offset,
+					(char *) mmap + src_offset + 8,
+					extsize) < 0)
+			goto unmap;
+		src_offset += 8;
+		src_offset += extsize;
+	}
+	return 0;
+unmap:
+	munmap(mmap, mmap_size);
+	die("index file corrupt");
+}
+
+#define WRITE_BUFFER_SIZE 8192
+static unsigned char write_buffer[WRITE_BUFFER_SIZE];
+static unsigned long write_buffer_len;
+
+static int ce_write_flush(git_SHA_CTX *context, int fd)
+{
+	unsigned int buffered = write_buffer_len;
+	if (buffered) {
+		git_SHA1_Update(context, write_buffer, buffered);
+		if (write_in_full(fd, write_buffer, buffered) != buffered)
+			return -1;
+		write_buffer_len = 0;
+	}
+	return 0;
+}
+
+static int ce_write(git_SHA_CTX *context, int fd, void *data, unsigned int len)
+{
+	while (len) {
+		unsigned int buffered = write_buffer_len;
+		unsigned int partial = WRITE_BUFFER_SIZE - buffered;
+		if (partial > len)
+			partial = len;
+		memcpy(write_buffer + buffered, data, partial);
+		buffered += partial;
+		if (buffered == WRITE_BUFFER_SIZE) {
+			write_buffer_len = buffered;
+			if (ce_write_flush(context, fd))
+				return -1;
+			buffered = 0;
+		}
+		write_buffer_len = buffered;
+		len -= partial;
+		data = (char *) data + partial;
+	}
+	return 0;
+}
+
+static int write_index_ext_header(git_SHA_CTX *context, int fd,
+				  unsigned int ext, unsigned int sz)
+{
+	ext = htonl(ext);
+	sz = htonl(sz);
+	return ((ce_write(context, fd, &ext, 4) < 0) ||
+		(ce_write(context, fd, &sz, 4) < 0)) ? -1 : 0;
+}
+
+static int ce_flush(git_SHA_CTX *context, int fd)
+{
+	unsigned int left = write_buffer_len;
+
+	if (left) {
+		write_buffer_len = 0;
+		git_SHA1_Update(context, write_buffer, left);
+	}
+
+	/* Flush first if not enough space for SHA1 signature */
+	if (left + 20 > WRITE_BUFFER_SIZE) {
+		if (write_in_full(fd, write_buffer, left) != left)
+			return -1;
+		left = 0;
+	}
+
+	/* Append the SHA1 signature at the end */
+	git_SHA1_Final(write_buffer + left, context);
+	left += 20;
+	return (write_in_full(fd, write_buffer, left) != left) ? -1 : 0;
+}
+
+static void ce_smudge_racily_clean_entry(struct index_state *istate, struct cache_entry *ce)
+{
+	/*
+	 * The only thing we care about in this function is to smudge the
+	 * falsely clean entry due to touch-update-touch race, so we leave
+	 * everything else as they are.  We are called for entries whose
+	 * ce_stat_data.sd_mtime match the index file mtime.
+	 *
+	 * Note that this actually does not do much for gitlinks, for
+	 * which ce_match_stat_basic() always goes to the actual
+	 * contents.  The caller checks with is_racy_timestamp() which
+	 * always says "no" for gitlinks, so we are not called for them ;-)
+	 */
+	struct stat st;
+
+	if (lstat(ce->name, &st) < 0)
+		return;
+	if (ce_match_stat_basic(istate, ce, &st))
+		return;
+	if (ce_modified_check_fs(ce, &st)) {
+		/* This is "racily clean"; smudge it.  Note that this
+		 * is a tricky code.  At first glance, it may appear
+		 * that it can break with this sequence:
+		 *
+		 * $ echo xyzzy >frotz
+		 * $ git-update-index --add frotz
+		 * $ : >frotz
+		 * $ sleep 3
+		 * $ echo filfre >nitfol
+		 * $ git-update-index --add nitfol
+		 *
+		 * but it does not.  When the second update-index runs,
+		 * it notices that the entry "frotz" has the same timestamp
+		 * as index, and if we were to smudge it by resetting its
+		 * size to zero here, then the object name recorded
+		 * in index is the 6-byte file but the cached stat information
+		 * becomes zero --- which would then match what we would
+		 * obtain from the filesystem next time we stat("frotz").
+		 *
+		 * However, the second update-index, before calling
+		 * this function, notices that the cached size is 6
+		 * bytes and what is on the filesystem is an empty
+		 * file, and never calls us, so the cached size information
+		 * for "frotz" stays 6 which does not match the filesystem.
+		 */
+		ce->ce_stat_data.sd_size = 0;
+	}
+}
+
+/* Copy miscellaneous fields but not the name */
+static char *copy_cache_entry_to_ondisk(struct ondisk_cache_entry *ondisk,
+				       struct cache_entry *ce)
+{
+	short flags;
+
+	ondisk->ctime.sec = htonl(ce->ce_stat_data.sd_ctime.sec);
+	ondisk->mtime.sec = htonl(ce->ce_stat_data.sd_mtime.sec);
+	ondisk->ctime.nsec = htonl(ce->ce_stat_data.sd_ctime.nsec);
+	ondisk->mtime.nsec = htonl(ce->ce_stat_data.sd_mtime.nsec);
+	ondisk->dev  = htonl(ce->ce_stat_data.sd_dev);
+	ondisk->ino  = htonl(ce->ce_stat_data.sd_ino);
+	ondisk->mode = htonl(ce->ce_mode);
+	ondisk->uid  = htonl(ce->ce_stat_data.sd_uid);
+	ondisk->gid  = htonl(ce->ce_stat_data.sd_gid);
+	ondisk->size = htonl(ce->ce_stat_data.sd_size);
+	hashcpy(ondisk->sha1, ce->sha1);
+
+	flags = ce->ce_flags;
+	flags |= (ce_namelen(ce) >= CE_NAMEMASK ? CE_NAMEMASK : ce_namelen(ce));
+	ondisk->flags = htons(flags);
+	if (ce->ce_flags & CE_EXTENDED) {
+		struct ondisk_cache_entry_extended *ondisk2;
+		ondisk2 = (struct ondisk_cache_entry_extended *)ondisk;
+		ondisk2->flags2 = htons((ce->ce_flags & CE_EXTENDED_FLAGS) >> 16);
+		return ondisk2->name;
+	}
+	else {
+		return ondisk->name;
+	}
+}
+
+static int ce_write_entry(git_SHA_CTX *c, int fd, struct cache_entry *ce,
+			  struct strbuf *previous_name)
+{
+	int size;
+	struct ondisk_cache_entry *ondisk;
+	char *name;
+	int result;
+
+	if (!previous_name) {
+		size = ondisk_ce_size(ce);
+		ondisk = xcalloc(1, size);
+		name = copy_cache_entry_to_ondisk(ondisk, ce);
+		memcpy(name, ce->name, ce_namelen(ce));
+	} else {
+		int common, to_remove, prefix_size;
+		unsigned char to_remove_vi[16];
+		for (common = 0;
+		     (ce->name[common] &&
+		      common < previous_name->len &&
+		      ce->name[common] == previous_name->buf[common]);
+		     common++)
+			; /* still matching */
+		to_remove = previous_name->len - common;
+		prefix_size = encode_varint(to_remove, to_remove_vi);
+
+		if (ce->ce_flags & CE_EXTENDED)
+			size = offsetof(struct ondisk_cache_entry_extended, name);
+		else
+			size = offsetof(struct ondisk_cache_entry, name);
+		size += prefix_size + (ce_namelen(ce) - common + 1);
+
+		ondisk = xcalloc(1, size);
+		name = copy_cache_entry_to_ondisk(ondisk, ce);
+		memcpy(name, to_remove_vi, prefix_size);
+		memcpy(name + prefix_size, ce->name + common, ce_namelen(ce) - common);
+
+		strbuf_splice(previous_name, common, to_remove,
+			      ce->name + common, ce_namelen(ce) - common);
+	}
+
+	result = ce_write(c, fd, ondisk, size);
+	free(ondisk);
+	return result;
+}
+
+static int write_index_v2(struct index_state *istate, int newfd)
+{
+	git_SHA_CTX c;
+	struct cache_version_header hdr;
+	struct cache_header hdr_v2;
+	int i, err, removed, extended, hdr_version;
+	struct cache_entry **cache = istate->cache;
+	int entries = istate->cache_nr;
+	struct stat st;
+	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
+
+	for (i = removed = extended = 0; i < entries; i++) {
+		if (cache[i]->ce_flags & CE_REMOVE)
+			removed++;
+
+		/* reduce extended entries if possible */
+		cache[i]->ce_flags &= ~CE_EXTENDED;
+		if (cache[i]->ce_flags & CE_EXTENDED_FLAGS) {
+			extended++;
+			cache[i]->ce_flags |= CE_EXTENDED;
+		}
+	}
+
+	if (!istate->version)
+		istate->version = INDEX_FORMAT_DEFAULT;
+
+	/* demote version 3 to version 2 when the latter suffices */
+	if (istate->version == 3 || istate->version == 2)
+		istate->version = extended ? 3 : 2;
+
+	hdr_version = istate->version;
+
+	hdr.hdr_signature = htonl(CACHE_SIGNATURE);
+	hdr.hdr_version = htonl(hdr_version);
+	hdr_v2.hdr_entries = htonl(entries - removed);
+
+	git_SHA1_Init(&c);
+	if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0)
+		return -1;
+	if (ce_write(&c, newfd, &hdr_v2, sizeof(hdr_v2)) < 0)
+		return -1;
+
+	previous_name = (hdr_version == 4) ? &previous_name_buf : NULL;
+	for (i = 0; i < entries; i++) {
+		struct cache_entry *ce = cache[i];
+		if (ce->ce_flags & CE_REMOVE)
+			continue;
+		if (!ce_uptodate(ce) && is_racy_timestamp(istate, ce))
+			ce_smudge_racily_clean_entry(istate, ce);
+		if (is_null_sha1(ce->sha1))
+			return error("cache entry has null sha1: %s", ce->name);
+		if (ce_write_entry(&c, newfd, ce, previous_name) < 0)
+			return -1;
+	}
+	strbuf_release(&previous_name_buf);
+
+	/* Write extension data here */
+	if (istate->cache_tree) {
+		struct strbuf sb = STRBUF_INIT;
+
+		cache_tree_write(&sb, istate->cache_tree);
+		err = write_index_ext_header(&c, newfd, CACHE_EXT_TREE, sb.len) < 0
+			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		strbuf_release(&sb);
+		if (err)
+			return -1;
+	}
+	if (istate->resolve_undo) {
+		struct strbuf sb = STRBUF_INIT;
+
+		resolve_undo_write(&sb, istate->resolve_undo);
+		err = write_index_ext_header(&c, newfd, CACHE_EXT_RESOLVE_UNDO,
+					     sb.len) < 0
+			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		strbuf_release(&sb);
+		if (err)
+			return -1;
+	}
+
+	if (ce_flush(&c, newfd) || fstat(newfd, &st))
+		return -1;
+	istate->timestamp.sec = (unsigned int)st.st_mtime;
+	istate->timestamp.nsec = ST_MTIME_NSEC(st);
+	return 0;
+}
+
+struct index_ops v2_ops = {
+	match_stat_basic,
+	verify_hdr,
+	read_index_v2,
+	write_index_v2
+};
diff --git a/read-cache.c b/read-cache.c
index 93947bf..1e7ffc2 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -5,6 +5,7 @@
  */
 #define NO_THE_INDEX_COMPATIBILITY_MACROS
 #include "cache.h"
+#include "read-cache.h"
 #include "cache-tree.h"
 #include "refs.h"
 #include "dir.h"
@@ -17,26 +18,9 @@
 
 static struct cache_entry *refresh_cache_entry(struct cache_entry *ce, int really);
 
-/* Mask for the name length in ce_flags in the on-disk index */
-
-#define CE_NAMEMASK  (0x0fff)
-
-/* Index extensions.
- *
- * The first letter should be 'A'..'Z' for extensions that are not
- * necessary for a correct operation (i.e. optimization data).
- * When new extensions are added that _needs_ to be understood in
- * order to correctly interpret the index file, pick character that
- * is outside the range, to cause the reader to abort.
- */
-
-#define CACHE_EXT(s) ( (s[0]<<24)|(s[1]<<16)|(s[2]<<8)|(s[3]) )
-#define CACHE_EXT_TREE 0x54524545	/* "TREE" */
-#define CACHE_EXT_RESOLVE_UNDO 0x52455543 /* "REUC" */
-
 struct index_state the_index;
 
-static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
 	istate->cache[nr] = ce;
 	add_name_hash(istate, ce);
@@ -190,7 +174,7 @@ static int ce_compare_gitlink(const struct cache_entry *ce)
 	return hashcmp(sha1, ce->sha1);
 }
 
-static int ce_modified_check_fs(const struct cache_entry *ce, struct stat *st)
+int ce_modified_check_fs(const struct cache_entry *ce, struct stat *st)
 {
 	switch (st->st_mode & S_IFMT) {
 	case S_IFREG:
@@ -210,7 +194,21 @@ static int ce_modified_check_fs(const struct cache_entry *ce, struct stat *st)
 	return 0;
 }
 
-static int ce_match_stat_basic(const struct cache_entry *ce, struct stat *st)
+/*
+ * Check if the reading/writing operations are set and set them
+ * to the correct version
+ */
+static void set_istate_ops(struct index_state *istate)
+{
+	if (!istate->version)
+		istate->version = INDEX_FORMAT_DEFAULT;
+
+	if (istate->version >= 2 && istate->version <= 4)
+		istate->ops = &v2_ops;
+}
+
+int ce_match_stat_basic(struct index_state *istate,
+			const struct cache_entry *ce, struct stat *st)
 {
 	unsigned int changed = 0;
 
@@ -243,19 +241,14 @@ static int ce_match_stat_basic(const struct cache_entry *ce, struct stat *st)
 		die("internal error: ce_mode is %o", ce->ce_mode);
 	}
 
-	changed |= match_stat_data(&ce->ce_stat_data, st);
-
-	/* Racily smudged entry? */
-	if (!ce->ce_stat_data.sd_size) {
-		if (!is_empty_blob_sha1(ce->sha1))
-			changed |= DATA_CHANGED;
-	}
-
+	set_istate_ops(istate);
+	changed = istate->ops->match_stat_basic(ce, st, changed);
 	return changed;
 }
 
-static int is_racy_timestamp(const struct index_state *istate,
-			     const struct cache_entry *ce)
+
+int is_racy_timestamp(const struct index_state *istate,
+		      const struct cache_entry *ce)
 {
 	return (!S_ISGITLINK(ce->ce_mode) &&
 		istate->timestamp.sec &&
@@ -270,9 +263,8 @@ static int is_racy_timestamp(const struct index_state *istate,
 		 );
 }
 
-int ie_match_stat(const struct index_state *istate,
-		  const struct cache_entry *ce, struct stat *st,
-		  unsigned int options)
+int ie_match_stat(struct index_state *istate, const struct cache_entry *ce,
+		  struct stat *st, unsigned int options)
 {
 	unsigned int changed;
 	int ignore_valid = options & CE_MATCH_IGNORE_VALID;
@@ -298,7 +290,7 @@ int ie_match_stat(const struct index_state *istate,
 	if (ce->ce_flags & CE_INTENT_TO_ADD)
 		return DATA_CHANGED | TYPE_CHANGED | MODE_CHANGED;
 
-	changed = ce_match_stat_basic(ce, st);
+	changed = ce_match_stat_basic(istate, ce, st);
 
 	/*
 	 * Within 1 second of this sequence:
@@ -326,8 +318,7 @@ int ie_match_stat(const struct index_state *istate,
 	return changed;
 }
 
-int ie_modified(const struct index_state *istate,
-		const struct cache_entry *ce,
+int ie_modified(struct index_state *istate, const struct cache_entry *ce,
 		struct stat *st, unsigned int options)
 {
 	int changed, changed_fs;
@@ -1211,13 +1202,10 @@ static struct cache_entry *refresh_cache_entry(struct cache_entry *ce, int reall
 	return refresh_cache_ent(&the_index, ce, really, NULL, NULL);
 }
 
-
 /*****************************************************************
  * Index File I/O
  *****************************************************************/
 
-#define INDEX_FORMAT_DEFAULT 3
-
 /*
  * dev/ino/uid/gid/size are also just tracked to the low 32 bits
  * Again - this is just a (very strong in practice) heuristic that
@@ -1268,7 +1256,8 @@ struct ondisk_cache_entry_extended {
 			    ondisk_cache_entry_extended_size(ce_namelen(ce)) : \
 			    ondisk_cache_entry_size(ce_namelen(ce)))
 
-static int verify_hdr_version(struct cache_version_header *hdr, unsigned long size)
+static int verify_hdr_version(struct index_state *istate,
+			      struct cache_version_header *hdr, unsigned long size)
 {
 	int hdr_version;
 
@@ -1277,43 +1266,7 @@ static int verify_hdr_version(struct cache_version_header *hdr, unsigned long si
 	hdr_version = ntohl(hdr->hdr_version);
 	if (hdr_version < INDEX_FORMAT_LB || INDEX_FORMAT_UB < hdr_version)
 		return error("bad index version %d", hdr_version);
-	return 0;
-}
-
-static int verify_hdr(void *mmap, unsigned long size)
-{
-	git_SHA_CTX c;
-	unsigned char sha1[20];
-
-	if (size < sizeof(struct cache_version_header)
-	    + sizeof(struct cache_header) + 20)
-		die("index file smaller than expected");
-
-	git_SHA1_Init(&c);
-	git_SHA1_Update(&c, mmap, size - 20);
-	git_SHA1_Final(sha1, &c);
-	if (hashcmp(sha1, (unsigned char *)mmap + size - 20))
-		return error("bad index file sha1 signature");
-	return 0;
-}
-
-static int read_index_extension(struct index_state *istate,
-				const char *ext, void *data, unsigned long sz)
-{
-	switch (CACHE_EXT(ext)) {
-	case CACHE_EXT_TREE:
-		istate->cache_tree = cache_tree_read(data, sz);
-		break;
-	case CACHE_EXT_RESOLVE_UNDO:
-		istate->resolve_undo = resolve_undo_read(data, sz);
-		break;
-	default:
-		if (*ext < 'A' || 'Z' < *ext)
-			return error("index uses %.4s extension, which we do not understand",
-				     ext);
-		fprintf(stderr, "ignoring %.4s extension\n", ext);
-		break;
-	}
+	istate->ops = &v2_ops;
 	return 0;
 }
 
@@ -1322,178 +1275,6 @@ int read_index(struct index_state *istate)
 	return read_index_from(istate, get_index_file());
 }
 
-#ifndef NEEDS_ALIGNED_ACCESS
-#define ntoh_s(var) ntohs(var)
-#define ntoh_l(var) ntohl(var)
-#else
-static inline uint16_t ntoh_s_force_align(void *p)
-{
-	uint16_t x;
-	memcpy(&x, p, sizeof(x));
-	return ntohs(x);
-}
-static inline uint32_t ntoh_l_force_align(void *p)
-{
-	uint32_t x;
-	memcpy(&x, p, sizeof(x));
-	return ntohl(x);
-}
-#define ntoh_s(var) ntoh_s_force_align(&(var))
-#define ntoh_l(var) ntoh_l_force_align(&(var))
-#endif
-
-static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk,
-						   unsigned int flags,
-						   const char *name,
-						   size_t len)
-{
-	struct cache_entry *ce = xmalloc(cache_entry_size(len));
-
-	ce->ce_stat_data.sd_ctime.sec = ntoh_l(ondisk->ctime.sec);
-	ce->ce_stat_data.sd_mtime.sec = ntoh_l(ondisk->mtime.sec);
-	ce->ce_stat_data.sd_ctime.nsec = ntoh_l(ondisk->ctime.nsec);
-	ce->ce_stat_data.sd_mtime.nsec = ntoh_l(ondisk->mtime.nsec);
-	ce->ce_stat_data.sd_dev   = ntoh_l(ondisk->dev);
-	ce->ce_stat_data.sd_ino   = ntoh_l(ondisk->ino);
-	ce->ce_mode  = ntoh_l(ondisk->mode);
-	ce->ce_stat_data.sd_uid   = ntoh_l(ondisk->uid);
-	ce->ce_stat_data.sd_gid   = ntoh_l(ondisk->gid);
-	ce->ce_stat_data.sd_size  = ntoh_l(ondisk->size);
-	ce->ce_flags = flags & ~CE_NAMEMASK;
-	ce->ce_namelen = len;
-	hashcpy(ce->sha1, ondisk->sha1);
-	memcpy(ce->name, name, len);
-	ce->name[len] = '\0';
-	return ce;
-}
-
-/*
- * Adjacent cache entries tend to share the leading paths, so it makes
- * sense to only store the differences in later entries.  In the v4
- * on-disk format of the index, each on-disk cache entry stores the
- * number of bytes to be stripped from the end of the previous name,
- * and the bytes to append to the result, to come up with its name.
- */
-static unsigned long expand_name_field(struct strbuf *name, const char *cp_)
-{
-	const unsigned char *ep, *cp = (const unsigned char *)cp_;
-	size_t len = decode_varint(&cp);
-
-	if (name->len < len)
-		die("malformed name field in the index");
-	strbuf_remove(name, name->len - len, len);
-	for (ep = cp; *ep; ep++)
-		; /* find the end */
-	strbuf_add(name, cp, ep - cp);
-	return (const char *)ep + 1 - cp_;
-}
-
-static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk,
-					    unsigned long *ent_size,
-					    struct strbuf *previous_name)
-{
-	struct cache_entry *ce;
-	size_t len;
-	const char *name;
-	unsigned int flags;
-
-	/* On-disk flags are just 16 bits */
-	flags = ntoh_s(ondisk->flags);
-	len = flags & CE_NAMEMASK;
-
-	if (flags & CE_EXTENDED) {
-		struct ondisk_cache_entry_extended *ondisk2;
-		int extended_flags;
-		ondisk2 = (struct ondisk_cache_entry_extended *)ondisk;
-		extended_flags = ntoh_s(ondisk2->flags2) << 16;
-		/* We do not yet understand any bit out of CE_EXTENDED_FLAGS */
-		if (extended_flags & ~CE_EXTENDED_FLAGS)
-			die("Unknown index entry format %08x", extended_flags);
-		flags |= extended_flags;
-		name = ondisk2->name;
-	}
-	else
-		name = ondisk->name;
-
-	if (!previous_name) {
-		/* v3 and earlier */
-		if (len == CE_NAMEMASK)
-			len = strlen(name);
-		ce = cache_entry_from_ondisk(ondisk, flags, name, len);
-
-		*ent_size = ondisk_ce_size(ce);
-	} else {
-		unsigned long consumed;
-		consumed = expand_name_field(previous_name, name);
-		ce = cache_entry_from_ondisk(ondisk, flags,
-					     previous_name->buf,
-					     previous_name->len);
-
-		*ent_size = (name - ((char *)ondisk)) + consumed;
-	}
-	return ce;
-}
-
-static int read_index_v2(struct index_state *istate, void *mmap, unsigned long mmap_size)
-{
-	int i;
-	unsigned long src_offset;
-	struct cache_version_header *hdr;
-	struct cache_header *hdr_v2;
-	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
-
-	hdr = mmap;
-	hdr_v2 = (struct cache_header *)((char *)mmap + sizeof(*hdr));
-
-	istate->version = ntohl(hdr->hdr_version);
-	istate->cache_nr = ntohl(hdr_v2->hdr_entries);
-	istate->cache_alloc = alloc_nr(istate->cache_nr);
-	istate->cache = xcalloc(istate->cache_alloc, sizeof(*istate->cache));
-	istate->initialized = 1;
-
-	if (istate->version == 4)
-		previous_name = &previous_name_buf;
-	else
-		previous_name = NULL;
-
-	src_offset = sizeof(*hdr) + sizeof(*hdr_v2);
-	for (i = 0; i < istate->cache_nr; i++) {
-		struct ondisk_cache_entry *disk_ce;
-		struct cache_entry *ce;
-		unsigned long consumed;
-
-		disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
-		ce = create_from_disk(disk_ce, &consumed, previous_name);
-		set_index_entry(istate, i, ce);
-
-		src_offset += consumed;
-	}
-	strbuf_release(&previous_name_buf);
-
-	while (src_offset <= mmap_size - 20 - 8) {
-		/* After an array of active_nr index entries,
-		 * there can be arbitrary number of extended
-		 * sections, each of which is prefixed with
-		 * extension name (4-byte) and section length
-		 * in 4-byte network byte order.
-		 */
-		uint32_t extsize;
-		memcpy(&extsize, (char *)mmap + src_offset + 4, 4);
-		extsize = ntohl(extsize);
-		if (read_index_extension(istate,
-					 (const char *) mmap + src_offset,
-					 (char *) mmap + src_offset + 8,
-					 extsize) < 0)
-			goto unmap;
-		src_offset += 8;
-		src_offset += extsize;
-	}
-	return 0;
-unmap:
-	munmap(mmap, mmap_size);
-	die("index file corrupt");
-}
-
 /* remember to discard_cache() before reading a different cache! */
 int read_index_from(struct index_state *istate, const char *path)
 {
@@ -1510,6 +1291,7 @@ int read_index_from(struct index_state *istate, const char *path)
 	errno = ENOENT;
 	istate->timestamp.sec = 0;
 	istate->timestamp.nsec = 0;
+
 	fd = open(path, O_RDONLY);
 	if (fd < 0) {
 		if (errno == ENOENT)
@@ -1522,24 +1304,23 @@ int read_index_from(struct index_state *istate, const char *path)
 
 	errno = EINVAL;
 	mmap_size = xsize_t(st.st_size);
-	if (mmap_size < sizeof(struct cache_header) + 20)
-		die("index file smaller than expected");
-
 	mmap = xmmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
 	close(fd);
 	if (mmap == MAP_FAILED)
 		die_errno("unable to map index file");
 
 	hdr = mmap;
-	if (verify_hdr_version(hdr, mmap_size) < 0)
+	if (verify_hdr_version(istate, hdr, mmap_size) < 0)
 		goto unmap;
 
-	if (verify_hdr(mmap, mmap_size) < 0)
+	if (istate->ops->verify_hdr(mmap, mmap_size) < 0)
 		goto unmap;
 
-	read_index_v2(istate, mmap, mmap_size);
+	if (istate->ops->read_index(istate, mmap, mmap_size) < 0)
+		goto unmap;
 	istate->timestamp.sec = st.st_mtime;
 	istate->timestamp.nsec = ST_MTIME_NSEC(st);
+
 	munmap(mmap, mmap_size);
 	return istate->cache_nr;
 
@@ -1583,201 +1364,6 @@ int unmerged_index(const struct index_state *istate)
 	return 0;
 }
 
-#define WRITE_BUFFER_SIZE 8192
-static unsigned char write_buffer[WRITE_BUFFER_SIZE];
-static unsigned long write_buffer_len;
-
-static int ce_write_flush(git_SHA_CTX *context, int fd)
-{
-	unsigned int buffered = write_buffer_len;
-	if (buffered) {
-		git_SHA1_Update(context, write_buffer, buffered);
-		if (write_in_full(fd, write_buffer, buffered) != buffered)
-			return -1;
-		write_buffer_len = 0;
-	}
-	return 0;
-}
-
-static int ce_write(git_SHA_CTX *context, int fd, void *data, unsigned int len)
-{
-	while (len) {
-		unsigned int buffered = write_buffer_len;
-		unsigned int partial = WRITE_BUFFER_SIZE - buffered;
-		if (partial > len)
-			partial = len;
-		memcpy(write_buffer + buffered, data, partial);
-		buffered += partial;
-		if (buffered == WRITE_BUFFER_SIZE) {
-			write_buffer_len = buffered;
-			if (ce_write_flush(context, fd))
-				return -1;
-			buffered = 0;
-		}
-		write_buffer_len = buffered;
-		len -= partial;
-		data = (char *) data + partial;
-	}
-	return 0;
-}
-
-static int write_index_ext_header(git_SHA_CTX *context, int fd,
-				  unsigned int ext, unsigned int sz)
-{
-	ext = htonl(ext);
-	sz = htonl(sz);
-	return ((ce_write(context, fd, &ext, 4) < 0) ||
-		(ce_write(context, fd, &sz, 4) < 0)) ? -1 : 0;
-}
-
-static int ce_flush(git_SHA_CTX *context, int fd)
-{
-	unsigned int left = write_buffer_len;
-
-	if (left) {
-		write_buffer_len = 0;
-		git_SHA1_Update(context, write_buffer, left);
-	}
-
-	/* Flush first if not enough space for SHA1 signature */
-	if (left + 20 > WRITE_BUFFER_SIZE) {
-		if (write_in_full(fd, write_buffer, left) != left)
-			return -1;
-		left = 0;
-	}
-
-	/* Append the SHA1 signature at the end */
-	git_SHA1_Final(write_buffer + left, context);
-	left += 20;
-	return (write_in_full(fd, write_buffer, left) != left) ? -1 : 0;
-}
-
-static void ce_smudge_racily_clean_entry(struct cache_entry *ce)
-{
-	/*
-	 * The only thing we care about in this function is to smudge the
-	 * falsely clean entry due to touch-update-touch race, so we leave
-	 * everything else as they are.  We are called for entries whose
-	 * ce_stat_data.sd_mtime match the index file mtime.
-	 *
-	 * Note that this actually does not do much for gitlinks, for
-	 * which ce_match_stat_basic() always goes to the actual
-	 * contents.  The caller checks with is_racy_timestamp() which
-	 * always says "no" for gitlinks, so we are not called for them ;-)
-	 */
-	struct stat st;
-
-	if (lstat(ce->name, &st) < 0)
-		return;
-	if (ce_match_stat_basic(ce, &st))
-		return;
-	if (ce_modified_check_fs(ce, &st)) {
-		/* This is "racily clean"; smudge it.  Note that this
-		 * is a tricky code.  At first glance, it may appear
-		 * that it can break with this sequence:
-		 *
-		 * $ echo xyzzy >frotz
-		 * $ git-update-index --add frotz
-		 * $ : >frotz
-		 * $ sleep 3
-		 * $ echo filfre >nitfol
-		 * $ git-update-index --add nitfol
-		 *
-		 * but it does not.  When the second update-index runs,
-		 * it notices that the entry "frotz" has the same timestamp
-		 * as index, and if we were to smudge it by resetting its
-		 * size to zero here, then the object name recorded
-		 * in index is the 6-byte file but the cached stat information
-		 * becomes zero --- which would then match what we would
-		 * obtain from the filesystem next time we stat("frotz").
-		 *
-		 * However, the second update-index, before calling
-		 * this function, notices that the cached size is 6
-		 * bytes and what is on the filesystem is an empty
-		 * file, and never calls us, so the cached size information
-		 * for "frotz" stays 6 which does not match the filesystem.
-		 */
-		ce->ce_stat_data.sd_size = 0;
-	}
-}
-
-/* Copy miscellaneous fields but not the name */
-static char *copy_cache_entry_to_ondisk(struct ondisk_cache_entry *ondisk,
-				       struct cache_entry *ce)
-{
-	short flags;
-
-	ondisk->ctime.sec = htonl(ce->ce_stat_data.sd_ctime.sec);
-	ondisk->mtime.sec = htonl(ce->ce_stat_data.sd_mtime.sec);
-	ondisk->ctime.nsec = htonl(ce->ce_stat_data.sd_ctime.nsec);
-	ondisk->mtime.nsec = htonl(ce->ce_stat_data.sd_mtime.nsec);
-	ondisk->dev  = htonl(ce->ce_stat_data.sd_dev);
-	ondisk->ino  = htonl(ce->ce_stat_data.sd_ino);
-	ondisk->mode = htonl(ce->ce_mode);
-	ondisk->uid  = htonl(ce->ce_stat_data.sd_uid);
-	ondisk->gid  = htonl(ce->ce_stat_data.sd_gid);
-	ondisk->size = htonl(ce->ce_stat_data.sd_size);
-	hashcpy(ondisk->sha1, ce->sha1);
-
-	flags = ce->ce_flags;
-	flags |= (ce_namelen(ce) >= CE_NAMEMASK ? CE_NAMEMASK : ce_namelen(ce));
-	ondisk->flags = htons(flags);
-	if (ce->ce_flags & CE_EXTENDED) {
-		struct ondisk_cache_entry_extended *ondisk2;
-		ondisk2 = (struct ondisk_cache_entry_extended *)ondisk;
-		ondisk2->flags2 = htons((ce->ce_flags & CE_EXTENDED_FLAGS) >> 16);
-		return ondisk2->name;
-	}
-	else {
-		return ondisk->name;
-	}
-}
-
-static int ce_write_entry(git_SHA_CTX *c, int fd, struct cache_entry *ce,
-			  struct strbuf *previous_name)
-{
-	int size;
-	struct ondisk_cache_entry *ondisk;
-	char *name;
-	int result;
-
-	if (!previous_name) {
-		size = ondisk_ce_size(ce);
-		ondisk = xcalloc(1, size);
-		name = copy_cache_entry_to_ondisk(ondisk, ce);
-		memcpy(name, ce->name, ce_namelen(ce));
-	} else {
-		int common, to_remove, prefix_size;
-		unsigned char to_remove_vi[16];
-		for (common = 0;
-		     (ce->name[common] &&
-		      common < previous_name->len &&
-		      ce->name[common] == previous_name->buf[common]);
-		     common++)
-			; /* still matching */
-		to_remove = previous_name->len - common;
-		prefix_size = encode_varint(to_remove, to_remove_vi);
-
-		if (ce->ce_flags & CE_EXTENDED)
-			size = offsetof(struct ondisk_cache_entry_extended, name);
-		else
-			size = offsetof(struct ondisk_cache_entry, name);
-		size += prefix_size + (ce_namelen(ce) - common + 1);
-
-		ondisk = xcalloc(1, size);
-		name = copy_cache_entry_to_ondisk(ondisk, ce);
-		memcpy(name, to_remove_vi, prefix_size);
-		memcpy(name + prefix_size, ce->name + common, ce_namelen(ce) - common);
-
-		strbuf_splice(previous_name, common, to_remove,
-			      ce->name + common, ce_namelen(ce) - common);
-	}
-
-	result = ce_write(c, fd, ondisk, size);
-	free(ondisk);
-	return result;
-}
-
 static int has_racy_timestamp(struct index_state *istate)
 {
 	int entries = istate->cache_nr;
@@ -1803,95 +1389,10 @@ void update_index_if_able(struct index_state *istate, struct lock_file *lockfile
 		rollback_lock_file(lockfile);
 }
 
-static int write_index_v2(struct index_state *istate, int newfd)
-{
-	git_SHA_CTX c;
-	struct cache_version_header hdr;
-	struct cache_header hdr_v2;
-	int i, err, removed, extended, hdr_version;
-	struct cache_entry **cache = istate->cache;
-	int entries = istate->cache_nr;
-	struct stat st;
-	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
-
-	for (i = removed = extended = 0; i < entries; i++) {
-		if (cache[i]->ce_flags & CE_REMOVE)
-			removed++;
-
-		/* reduce extended entries if possible */
-		cache[i]->ce_flags &= ~CE_EXTENDED;
-		if (cache[i]->ce_flags & CE_EXTENDED_FLAGS) {
-			extended++;
-			cache[i]->ce_flags |= CE_EXTENDED;
-		}
-	}
-
-	if (!istate->version)
-		istate->version = INDEX_FORMAT_DEFAULT;
-
-	/* demote version 3 to version 2 when the latter suffices */
-	if (istate->version == 3 || istate->version == 2)
-		istate->version = extended ? 3 : 2;
-
-	hdr_version = istate->version;
-
-	hdr.hdr_signature = htonl(CACHE_SIGNATURE);
-	hdr.hdr_version = htonl(hdr_version);
-	hdr_v2.hdr_entries = htonl(entries - removed);
-
-	git_SHA1_Init(&c);
-	if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0)
-		return -1;
-	if (ce_write(&c, newfd, &hdr_v2, sizeof(hdr_v2)) < 0)
-		return -1;
-
-	previous_name = (hdr_version == 4) ? &previous_name_buf : NULL;
-	for (i = 0; i < entries; i++) {
-		struct cache_entry *ce = cache[i];
-		if (ce->ce_flags & CE_REMOVE)
-			continue;
-		if (!ce_uptodate(ce) && is_racy_timestamp(istate, ce))
-			ce_smudge_racily_clean_entry(ce);
-		if (is_null_sha1(ce->sha1))
-			return error("cache entry has null sha1: %s", ce->name);
-		if (ce_write_entry(&c, newfd, ce, previous_name) < 0)
-			return -1;
-	}
-	strbuf_release(&previous_name_buf);
-
-	/* Write extension data here */
-	if (istate->cache_tree) {
-		struct strbuf sb = STRBUF_INIT;
-
-		cache_tree_write(&sb, istate->cache_tree);
-		err = write_index_ext_header(&c, newfd, CACHE_EXT_TREE, sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
-		strbuf_release(&sb);
-		if (err)
-			return -1;
-	}
-	if (istate->resolve_undo) {
-		struct strbuf sb = STRBUF_INIT;
-
-		resolve_undo_write(&sb, istate->resolve_undo);
-		err = write_index_ext_header(&c, newfd, CACHE_EXT_RESOLVE_UNDO,
-					     sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
-		strbuf_release(&sb);
-		if (err)
-			return -1;
-	}
-
-	if (ce_flush(&c, newfd) || fstat(newfd, &st))
-		return -1;
-	istate->timestamp.sec = (unsigned int)st.st_mtime;
-	istate->timestamp.nsec = ST_MTIME_NSEC(st);
-	return 0;
-}
-
 int write_index(struct index_state *istate, int newfd)
 {
-	return write_index_v2(istate, newfd);
+	set_istate_ops(istate);
+	return istate->ops->write_index(istate, newfd);
 }
 
 /*
diff --git a/read-cache.h b/read-cache.h
new file mode 100644
index 0000000..f31a38b
--- /dev/null
+++ b/read-cache.h
@@ -0,0 +1,57 @@
+/* Index extensions.
+ *
+ * The first letter should be 'A'..'Z' for extensions that are not
+ * necessary for a correct operation (i.e. optimization data).
+ * When new extensions are added that _needs_ to be understood in
+ * order to correctly interpret the index file, pick character that
+ * is outside the range, to cause the reader to abort.
+ */
+
+#define CACHE_EXT(s) ( (s[0]<<24)|(s[1]<<16)|(s[2]<<8)|(s[3]) )
+#define CACHE_EXT_TREE 0x54524545	/* "TREE" */
+#define CACHE_EXT_RESOLVE_UNDO 0x52455543 /* "REUC" */
+
+#define INDEX_FORMAT_DEFAULT 3
+
+/*
+ * Basic data structures for the directory cache
+ */
+struct cache_version_header {
+	unsigned int hdr_signature;
+	unsigned int hdr_version;
+};
+
+struct index_ops {
+	int (*match_stat_basic)(const struct cache_entry *ce, struct stat *st, int changed);
+	int (*verify_hdr)(void *mmap, unsigned long size);
+	int (*read_index)(struct index_state *istate, void *mmap, unsigned long mmap_size);
+	int (*write_index)(struct index_state *istate, int newfd);
+};
+
+extern struct index_ops v2_ops;
+
+#ifndef NEEDS_ALIGNED_ACCESS
+#define ntoh_s(var) ntohs(var)
+#define ntoh_l(var) ntohl(var)
+#else
+static inline uint16_t ntoh_s_force_align(void *p)
+{
+	uint16_t x;
+	memcpy(&x, p, sizeof(x));
+	return ntohs(x);
+}
+static inline uint32_t ntoh_l_force_align(void *p)
+{
+	uint32_t x;
+	memcpy(&x, p, sizeof(x));
+	return ntohl(x);
+}
+#define ntoh_s(var) ntoh_s_force_align(&(var))
+#define ntoh_l(var) ntoh_l_force_align(&(var))
+#endif
+
+extern int ce_modified_check_fs(const struct cache_entry *ce, struct stat *st);
+extern int ce_match_stat_basic(struct index_state *istate, const struct cache_entry *ce,
+			       struct stat *st);
+extern int is_racy_timestamp(const struct index_state *istate, const struct cache_entry *ce);
+extern void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce);
diff --git a/test-index-version.c b/test-index-version.c
index 4c0386f..65545a7 100644
--- a/test-index-version.c
+++ b/test-index-version.c
@@ -1,5 +1,10 @@
 #include "cache.h"
 
+struct cache_version_header {
+	unsigned int hdr_signature;
+	unsigned int hdr_version;
+};
+
 int main(int argc, char **argv)
 {
 	struct cache_version_header hdr;
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 04/19] read-cache: Re-read index if index file changed
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (2 preceding siblings ...)
  2013-07-12 17:26 ` [PATCH v2 03/19] read-cache: move index v2 specific functions to their own file Thomas Gummerer
@ 2013-07-12 17:26 ` Thomas Gummerer
  2013-07-12 17:26 ` [PATCH v2 05/19] Add documentation for the index api Thomas Gummerer
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Add the possibility of re-reading the index file, if it changed
while reading.

The index file might change during the read, causing outdated
information to be displayed. We check if the index file changed
by using its stat data as heuristic.

Helped-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 read-cache.c | 91 +++++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 57 insertions(+), 34 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 1e7ffc2..3e3a0e2 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1275,11 +1275,31 @@ int read_index(struct index_state *istate)
 	return read_index_from(istate, get_index_file());
 }
 
+static int index_changed(struct stat *st_old, struct stat *st_new)
+{
+	if (st_old->st_mtime != st_new->st_mtime ||
+#if !defined (__CYGWIN__)
+	    st_old->st_uid   != st_new->st_uid ||
+	    st_old->st_gid   != st_new->st_gid ||
+	    st_old->st_ino   != st_new->st_ino ||
+#endif
+#if USE_NSEC
+	    ST_MTIME_NSEC(*st_old) != ST_MTIME_NSEC(*st_new) ||
+#endif
+#if USE_STDEV
+	    st_old->st_dev != st_new->st_dev ||
+#endif
+	    st_old->st_size != st_new->st_size)
+		return 1;
+
+	return 0;
+}
+
 /* remember to discard_cache() before reading a different cache! */
 int read_index_from(struct index_state *istate, const char *path)
 {
-	int fd;
-	struct stat st;
+	int fd, err, i;
+	struct stat st_old, st_new;
 	struct cache_version_header *hdr;
 	void *mmap;
 	size_t mmap_size;
@@ -1291,41 +1311,44 @@ int read_index_from(struct index_state *istate, const char *path)
 	errno = ENOENT;
 	istate->timestamp.sec = 0;
 	istate->timestamp.nsec = 0;
+	for (i = 0; i < 50; i++) {
+		err = 0;
+		fd = open(path, O_RDONLY);
+		if (fd < 0) {
+			if (errno == ENOENT)
+				return 0;
+			die_errno("index file open failed");
+		}
 
-	fd = open(path, O_RDONLY);
-	if (fd < 0) {
-		if (errno == ENOENT)
-			return 0;
-		die_errno("index file open failed");
+		if (fstat(fd, &st_old))
+			die_errno("cannot stat the open index");
+
+		errno = EINVAL;
+		mmap_size = xsize_t(st_old.st_size);
+		mmap = xmmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
+		close(fd);
+		if (mmap == MAP_FAILED)
+			die_errno("unable to map index file");
+
+		hdr = mmap;
+		if (verify_hdr_version(istate, hdr, mmap_size) < 0)
+			err = 1;
+
+		if (istate->ops->verify_hdr(mmap, mmap_size) < 0)
+			err = 1;
+
+		if (istate->ops->read_index(istate, mmap, mmap_size) < 0)
+			err = 1;
+		istate->timestamp.sec = st_old.st_mtime;
+		istate->timestamp.nsec = ST_MTIME_NSEC(st_old);
+		if (lstat(path, &st_new))
+			die_errno("cannot stat the open index");
+
+		munmap(mmap, mmap_size);
+		if (!index_changed(&st_old, &st_new) && !err)
+			return istate->cache_nr;
 	}
 
-	if (fstat(fd, &st))
-		die_errno("cannot stat the open index");
-
-	errno = EINVAL;
-	mmap_size = xsize_t(st.st_size);
-	mmap = xmmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
-	close(fd);
-	if (mmap == MAP_FAILED)
-		die_errno("unable to map index file");
-
-	hdr = mmap;
-	if (verify_hdr_version(istate, hdr, mmap_size) < 0)
-		goto unmap;
-
-	if (istate->ops->verify_hdr(mmap, mmap_size) < 0)
-		goto unmap;
-
-	if (istate->ops->read_index(istate, mmap, mmap_size) < 0)
-		goto unmap;
-	istate->timestamp.sec = st.st_mtime;
-	istate->timestamp.nsec = ST_MTIME_NSEC(st);
-
-	munmap(mmap, mmap_size);
-	return istate->cache_nr;
-
-unmap:
-	munmap(mmap, mmap_size);
 	die("index file corrupt");
 }
 
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 05/19] Add documentation for the index api
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (3 preceding siblings ...)
  2013-07-12 17:26 ` [PATCH v2 04/19] read-cache: Re-read index if index file changed Thomas Gummerer
@ 2013-07-12 17:26 ` Thomas Gummerer
  2013-07-12 17:26 ` [PATCH v2 06/19] read-cache: add index reading api Thomas Gummerer
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Add documentation for the index reading api.  This also includes
documentation for the new api functions introduced in the next patch.

Helped-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 Documentation/technical/api-in-core-index.txt | 54 +++++++++++++++++++++++++--
 1 file changed, 50 insertions(+), 4 deletions(-)

diff --git a/Documentation/technical/api-in-core-index.txt b/Documentation/technical/api-in-core-index.txt
index adbdbf5..9b8c37c 100644
--- a/Documentation/technical/api-in-core-index.txt
+++ b/Documentation/technical/api-in-core-index.txt
@@ -1,14 +1,60 @@
 in-core index API
 =================
 
+Reading API
+-----------
+
+`cache`::
+
+	An array of cache entries.  This is used to access the cache
+	entries directly.  Use `index_name_pos` to search for the
+	index of a specific cache entry.
+
+`read_index_filtered`::
+
+	Read a part of the index, filtered by the pathspec given in
+	the opts.  The function may load more than necessary, so the
+	caller still responsible to apply filters appropriately.  The
+	filtering is only done for performance reasons, as it's
+	possible to only read part of the index when the on-disk
+	format is index-v5.
+
+	To iterate only over the entries that match the pathspec, use
+	the for_each_index_entry function.
+
+`read_index`::
+
+	Read the whole index file from disk.
+
+`index_name_pos`::
+
+	Find a cache_entry with name in the index.  Returns pos if an
+	entry is matched exactly and -1-pos if an entry is matched
+	partially.
+	e.g.
+	index:
+	file1
+	file2
+	path/file1
+	zzz
+
+	index_name_pos("path/file1", 10) returns 2, while
+	index_name_pos("path", 4) returns -3
+
+`for_each_index_entry`::
+
+	Iterates over all cache_entries in the index filtered by
+	filter_opts in the index_state.  For each cache entry fn is
+	executed with cb_data as callback data.  From within the loop
+	do `return 0` to continue, or `return 1` to break the loop.
+
+TODO
+----
 Talk about <read-cache.c> and <cache-tree.c>, things like:
 
-* cache -> the_index macros
-* read_index()
 * write_index()
 * ie_match_stat() and ie_modified(); how they are different and when to
   use which.
-* index_name_pos()
 * remove_index_entry_at()
 * remove_file_from_index()
 * add_file_to_index()
@@ -18,4 +64,4 @@ Talk about <read-cache.c> and <cache-tree.c>, things like:
 * cache_tree_invalidate_path()
 * cache_tree_update()
 
-(JC, Linus)
+(JC, Linus, Thomas Gummerer)
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 06/19] read-cache: add index reading api
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (4 preceding siblings ...)
  2013-07-12 17:26 ` [PATCH v2 05/19] Add documentation for the index api Thomas Gummerer
@ 2013-07-12 17:26 ` Thomas Gummerer
  2013-07-14  3:21   ` Duy Nguyen
  2013-07-12 17:26 ` [PATCH v2 07/19] make sure partially read index is not changed Thomas Gummerer
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Add an api for access to the index file.  Currently there is only a very
basic api for accessing the index file, which only allows a full read of
the index, and lets the users of the data filter it.  The new index api
gives the users the possibility to use only part of the index and
provides functions for iterating over and accessing cache entries.

This simplifies future improvements to the in-memory format, as changes
will be concentrated on one file, instead of the whole git source code.

Helped-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 cache.h         | 42 +++++++++++++++++++++++++++++++++++++++++-
 read-cache-v2.c | 35 +++++++++++++++++++++++++++++++++--
 read-cache.c    | 44 ++++++++++++++++++++++++++++++++++++++++----
 read-cache.h    |  8 +++++++-
 4 files changed, 121 insertions(+), 8 deletions(-)

diff --git a/cache.h b/cache.h
index 5082b34..d305d21 100644
--- a/cache.h
+++ b/cache.h
@@ -127,7 +127,7 @@ struct cache_entry {
 	unsigned int ce_flags;
 	unsigned int ce_namelen;
 	unsigned char sha1[20];
-	struct cache_entry *next;
+	struct cache_entry *next; /* used by name_hash */
 	char name[FLEX_ARRAY]; /* more */
 };
 
@@ -258,6 +258,29 @@ static inline unsigned int canon_mode(unsigned int mode)
 
 #define cache_entry_size(len) (offsetof(struct cache_entry,name) + (len) + 1)
 
+/*
+ * Options by which the index should be filtered when read partially.
+ *
+ * pathspec: The pathspec which the index entries have to match
+ * seen: Used to return the seen parameter from match_pathspec()
+ * max_prefix_len: The common prefix length of the pathspecs
+ *
+ * read_staged: used to indicate if the conflicted entries (entries
+ *     with a stage) should be included
+ * read_cache_tree: used to indicate if the cache-tree should be read
+ * read_resolve_undo: used to indicate if the resolve undo data should
+ *     be read
+ */
+struct filter_opts {
+	const struct pathspec *pathspec;
+	char *seen;
+	int max_prefix_len;
+
+	int read_staged;
+	int read_cache_tree;
+	int read_resolve_undo;
+};
+
 struct index_state {
 	struct cache_entry **cache;
 	unsigned int version;
@@ -270,6 +293,8 @@ struct index_state {
 	struct hash_table name_hash;
 	struct hash_table dir_hash;
 	struct index_ops *ops;
+	struct internal_ops *internal_ops;
+	struct filter_opts *filter_opts;
 };
 
 extern struct index_state the_index;
@@ -311,6 +336,12 @@ extern void free_name_hash(struct index_state *istate);
 #define unmerge_cache_entry_at(at) unmerge_index_entry_at(&the_index, at)
 #define unmerge_cache(pathspec) unmerge_index(&the_index, pathspec)
 #define read_blob_data_from_cache(path, sz) read_blob_data_from_index(&the_index, (path), (sz))
+
+/* index api */
+#define read_cache_filtered(opts) read_index_filtered(&the_index, (opts))
+#define read_cache_filtered_from(path, opts) read_index_filtered_from(&the_index, (path), (opts))
+#define for_each_cache_entry(fn, cb_data) \
+	for_each_index_entry(&the_index, (fn), (cb_data))
 #endif
 
 enum object_type {
@@ -438,6 +469,15 @@ extern int init_db(const char *template_dir, unsigned int flags);
 		} \
 	} while (0)
 
+/* index api */
+extern int read_index_filtered(struct index_state *, struct filter_opts *opts);
+extern int read_index_filtered_from(struct index_state *, const char *path, struct filter_opts *opts);
+
+typedef int each_cache_entry_fn(struct cache_entry *ce, void *);
+extern int for_each_index_entry(struct index_state *istate,
+				each_cache_entry_fn, void *);
+
+
 /* Initialize and use the cache information */
 extern int read_index(struct index_state *);
 extern int read_index_preload(struct index_state *, const char **pathspec);
diff --git a/read-cache-v2.c b/read-cache-v2.c
index a6883c3..51b618f 100644
--- a/read-cache-v2.c
+++ b/read-cache-v2.c
@@ -3,6 +3,7 @@
 #include "resolve-undo.h"
 #include "cache-tree.h"
 #include "varint.h"
+#include "dir.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 #define CE_NAMEMASK  (0x0fff)
@@ -207,8 +208,14 @@ static int read_index_extension(struct index_state *istate,
 	return 0;
 }
 
+/*
+ * The performance is the same if we read the whole index or only
+ * part of it, therefore we always read the whole index to avoid
+ * having to re-read it later.  The filter_opts will determine
+ * what part of the index is used when retrieving the cache-entries.
+ */
 static int read_index_v2(struct index_state *istate, void *mmap,
-			 unsigned long mmap_size)
+			 unsigned long mmap_size, struct filter_opts *opts)
 {
 	int i;
 	unsigned long src_offset;
@@ -238,7 +245,6 @@ static int read_index_v2(struct index_state *istate, void *mmap,
 		disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
 		ce = create_from_disk(disk_ce, &consumed, previous_name);
 		set_index_entry(istate, i, ce);
-
 		src_offset += consumed;
 	}
 	strbuf_release(&previous_name_buf);
@@ -548,9 +554,34 @@ static int write_index_v2(struct index_state *istate, int newfd)
 	return 0;
 }
 
+int for_each_index_entry_v2(struct index_state *istate, each_cache_entry_fn fn, void *cb_data)
+{
+	int i, ret = 0;
+	struct filter_opts *opts= istate->filter_opts;
+
+	for (i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+
+		if (opts && !opts->read_staged && ce_stage(ce))
+			continue;
+
+		if (opts && !match_pathspec_depth(opts->pathspec, ce->name, ce_namelen(ce),
+						  opts->max_prefix_len, opts->seen))
+			continue;
+
+		if ((ret = fn(istate->cache[i], cb_data)))
+			break;
+	}
+	return ret;
+}
+
 struct index_ops v2_ops = {
 	match_stat_basic,
 	verify_hdr,
 	read_index_v2,
 	write_index_v2
 };
+
+struct internal_ops v2_internal_ops = {
+	for_each_index_entry_v2,
+};
diff --git a/read-cache.c b/read-cache.c
index 3e3a0e2..9053d43 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -996,6 +996,7 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
 		memmove(istate->cache + pos + 1,
 			istate->cache + pos,
 			(istate->cache_nr - pos - 1) * sizeof(ce));
+
 	set_index_entry(istate, pos, ce);
 	istate->cache_changed = 1;
 	return 0;
@@ -1272,7 +1273,32 @@ static int verify_hdr_version(struct index_state *istate,
 
 int read_index(struct index_state *istate)
 {
-	return read_index_from(istate, get_index_file());
+	return read_index_filtered_from(istate, get_index_file(), NULL);
+}
+
+int read_index_filtered(struct index_state *istate, struct filter_opts *opts)
+{
+	return read_index_filtered_from(istate, get_index_file(), opts);
+}
+
+int set_internal_ops(struct index_state *istate)
+{
+	if (!istate->internal_ops && istate->cache)
+		istate->internal_ops = &v2_internal_ops;
+	if (!istate->internal_ops)
+		return 0;
+	return 1;
+}
+
+/*
+ * Execute fn for each index entry which is currently in istate.  Data
+ * can be given to the function using the cb_data parameter.
+ */
+int for_each_index_entry(struct index_state *istate, each_cache_entry_fn fn, void *cb_data)
+{
+	if (!set_internal_ops(istate))
+		return 0;
+	return istate->internal_ops->for_each_index_entry(istate, fn, cb_data);
 }
 
 static int index_changed(struct stat *st_old, struct stat *st_new)
@@ -1295,8 +1321,9 @@ static int index_changed(struct stat *st_old, struct stat *st_new)
 	return 0;
 }
 
-/* remember to discard_cache() before reading a different cache! */
-int read_index_from(struct index_state *istate, const char *path)
+
+int read_index_filtered_from(struct index_state *istate, const char *path,
+			     struct filter_opts *opts)
 {
 	int fd, err, i;
 	struct stat st_old, st_new;
@@ -1337,7 +1364,7 @@ int read_index_from(struct index_state *istate, const char *path)
 		if (istate->ops->verify_hdr(mmap, mmap_size) < 0)
 			err = 1;
 
-		if (istate->ops->read_index(istate, mmap, mmap_size) < 0)
+		if (istate->ops->read_index(istate, mmap, mmap_size, opts) < 0)
 			err = 1;
 		istate->timestamp.sec = st_old.st_mtime;
 		istate->timestamp.nsec = ST_MTIME_NSEC(st_old);
@@ -1345,6 +1372,7 @@ int read_index_from(struct index_state *istate, const char *path)
 			die_errno("cannot stat the open index");
 
 		munmap(mmap, mmap_size);
+		istate->filter_opts = opts;
 		if (!index_changed(&st_old, &st_new) && !err)
 			return istate->cache_nr;
 	}
@@ -1352,6 +1380,13 @@ int read_index_from(struct index_state *istate, const char *path)
 	die("index file corrupt");
 }
 
+
+/* remember to discard_cache() before reading a different cache! */
+int read_index_from(struct index_state *istate, const char *path)
+{
+	return read_index_filtered_from(istate, path, NULL);
+}
+
 int is_index_unborn(struct index_state *istate)
 {
 	return (!istate->cache_nr && !istate->timestamp.sec);
@@ -1374,6 +1409,7 @@ int discard_index(struct index_state *istate)
 	free(istate->cache);
 	istate->cache = NULL;
 	istate->cache_alloc = 0;
+	istate->filter_opts = NULL;
 	return 0;
 }
 
diff --git a/read-cache.h b/read-cache.h
index f31a38b..e72a585 100644
--- a/read-cache.h
+++ b/read-cache.h
@@ -24,11 +24,17 @@ struct cache_version_header {
 struct index_ops {
 	int (*match_stat_basic)(const struct cache_entry *ce, struct stat *st, int changed);
 	int (*verify_hdr)(void *mmap, unsigned long size);
-	int (*read_index)(struct index_state *istate, void *mmap, unsigned long mmap_size);
+	int (*read_index)(struct index_state *istate, void *mmap, unsigned long mmap_size,
+			  struct filter_opts *opts);
 	int (*write_index)(struct index_state *istate, int newfd);
 };
 
+struct internal_ops {
+	int (*for_each_index_entry)(struct index_state *istate, each_cache_entry_fn fn, void *cb_data);
+};
+
 extern struct index_ops v2_ops;
+extern struct internal_ops v2_internal_ops;
 
 #ifndef NEEDS_ALIGNED_ACCESS
 #define ntoh_s(var) ntohs(var)
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 07/19] make sure partially read index is not changed
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (5 preceding siblings ...)
  2013-07-12 17:26 ` [PATCH v2 06/19] read-cache: add index reading api Thomas Gummerer
@ 2013-07-12 17:26 ` Thomas Gummerer
  2013-07-14  3:29   ` Duy Nguyen
  2013-07-12 17:26 ` [PATCH v2 08/19] grep.c: Use index api Thomas Gummerer
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

A partially read index file currently cannot be written to disk.  Make
sure that never happens, by erroring out when a caller tries to change a
partially read index.  The caller is responsible for reading the whole
index when it's trying to change it later.

Forcing the caller to load the right part of the index file instead of
re-reading it when changing it, gives a bit of a performance advantage,
by avoiding to read parts of the index twice.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 builtin/update-index.c |  4 ++++
 cache.h                |  1 +
 read-cache-v2.c        |  2 ++
 read-cache.c           | 10 ++++++++++
 4 files changed, 17 insertions(+)

diff --git a/builtin/update-index.c b/builtin/update-index.c
index 5c7762e..4c6e3a6 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -49,6 +49,8 @@ static int mark_ce_flags(const char *path, int flag, int mark)
 	int namelen = strlen(path);
 	int pos = cache_name_pos(path, namelen);
 	if (0 <= pos) {
+		if (active_cache_partially_read)
+			die("BUG: Can't change a partially read index");
 		if (mark)
 			active_cache[pos]->ce_flags |= flag;
 		else
@@ -253,6 +255,8 @@ static void chmod_path(int flip, const char *path)
 	pos = cache_name_pos(path, strlen(path));
 	if (pos < 0)
 		goto fail;
+	if (active_cache_partially_read)
+		die("BUG: Can't change a partially read index");
 	ce = active_cache[pos];
 	mode = ce->ce_mode;
 	if (!S_ISREG(mode))
diff --git a/cache.h b/cache.h
index d305d21..455b772 100644
--- a/cache.h
+++ b/cache.h
@@ -311,6 +311,7 @@ extern void free_name_hash(struct index_state *istate);
 #define active_alloc (the_index.cache_alloc)
 #define active_cache_changed (the_index.cache_changed)
 #define active_cache_tree (the_index.cache_tree)
+#define active_cache_partially_read (the_index.filter_opts)
 
 #define read_cache() read_index(&the_index)
 #define read_cache_from(path) read_index_from(&the_index, (path))
diff --git a/read-cache-v2.c b/read-cache-v2.c
index 51b618f..f3c0685 100644
--- a/read-cache-v2.c
+++ b/read-cache-v2.c
@@ -479,6 +479,8 @@ static int write_index_v2(struct index_state *istate, int newfd)
 	struct stat st;
 	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
 
+	if (istate->filter_opts)
+		die("BUG: index: cannot write a partially read index");
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
 			removed++;
diff --git a/read-cache.c b/read-cache.c
index 9053d43..ab716ed 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -30,6 +30,8 @@ static void replace_index_entry(struct index_state *istate, int nr, struct cache
 {
 	struct cache_entry *old = istate->cache[nr];
 
+	if (istate->filter_opts)
+		die("BUG: Can't change a partially read index");
 	remove_name_hash(istate, old);
 	set_index_entry(istate, nr, ce);
 	istate->cache_changed = 1;
@@ -467,6 +469,8 @@ int remove_index_entry_at(struct index_state *istate, int pos)
 {
 	struct cache_entry *ce = istate->cache[pos];
 
+	if (istate->filter_opts)
+		die("BUG: Can't change a partially read index");
 	record_resolve_undo(istate, ce);
 	remove_name_hash(istate, ce);
 	istate->cache_changed = 1;
@@ -973,6 +977,8 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
 {
 	int pos;
 
+	if (istate->filter_opts)
+		die("BUG: Can't change a partially read index");
 	if (option & ADD_CACHE_JUST_APPEND)
 		pos = istate->cache_nr;
 	else {
@@ -1173,6 +1179,8 @@ int refresh_index(struct index_state *istate, unsigned int flags, const char **p
 				/* If we are doing --really-refresh that
 				 * means the index is not valid anymore.
 				 */
+				if (istate->filter_opts)
+					die("BUG: Can't change a partially read index");
 				ce->ce_flags &= ~CE_VALID;
 				istate->cache_changed = 1;
 			}
@@ -1331,6 +1339,8 @@ int read_index_filtered_from(struct index_state *istate, const char *path,
 	void *mmap;
 	size_t mmap_size;
 
+	if (istate->filter_opts)
+		die("BUG: Can't re-read partially read index");
 	errno = EBUSY;
 	if (istate->initialized)
 		return istate->cache_nr;
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 08/19] grep.c: Use index api
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (6 preceding siblings ...)
  2013-07-12 17:26 ` [PATCH v2 07/19] make sure partially read index is not changed Thomas Gummerer
@ 2013-07-12 17:26 ` Thomas Gummerer
  2013-07-14  3:32   ` Duy Nguyen
  2013-07-12 17:26 ` [PATCH v2 09/19] ls-files.c: use " Thomas Gummerer
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 builtin/grep.c | 71 ++++++++++++++++++++++++++++++----------------------------
 1 file changed, 37 insertions(+), 34 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index a419cda..8b02644 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -368,41 +368,33 @@ static void run_pager(struct grep_opt *opt, const char *prefix)
 	free(argv);
 }
 
-static int grep_cache(struct grep_opt *opt, const struct pathspec *pathspec, int cached)
+struct grep_opts {
+	struct grep_opt *opt;
+	const struct pathspec *pathspec;
+	int cached;
+	int hit;
+};
+
+static int grep_cache(struct cache_entry *ce, void *cb_data)
 {
-	int hit = 0;
-	int nr;
-	read_cache();
+	struct grep_opts *opts = cb_data;
 
-	for (nr = 0; nr < active_nr; nr++) {
-		struct cache_entry *ce = active_cache[nr];
-		if (!S_ISREG(ce->ce_mode))
-			continue;
-		if (!match_pathspec_depth(pathspec, ce->name, ce_namelen(ce), 0, NULL))
-			continue;
-		/*
-		 * If CE_VALID is on, we assume worktree file and its cache entry
-		 * are identical, even if worktree file has been modified, so use
-		 * cache version instead
-		 */
-		if (cached || (ce->ce_flags & CE_VALID) || ce_skip_worktree(ce)) {
-			if (ce_stage(ce))
-				continue;
-			hit |= grep_sha1(opt, ce->sha1, ce->name, 0, ce->name);
-		}
-		else
-			hit |= grep_file(opt, ce->name);
-		if (ce_stage(ce)) {
-			do {
-				nr++;
-			} while (nr < active_nr &&
-				 !strcmp(ce->name, active_cache[nr]->name));
-			nr--; /* compensate for loop control */
-		}
-		if (hit && opt->status_only)
-			break;
-	}
-	return hit;
+	if (!S_ISREG(ce->ce_mode))
+		return 0;
+	if (!match_pathspec_depth(opts->pathspec, ce->name, ce_namelen(ce), 0, NULL))
+		return 0;
+	/*
+	 * If CE_VALID is on, we assume worktree file and its cache entry
+	 * are identical, even if worktree file has been modified, so use
+	 * cache version instead
+	 */
+	if (opts->cached || (ce->ce_flags & CE_VALID) || ce_skip_worktree(ce))
+		opts->hit |= grep_sha1(opts->opt, ce->sha1, ce->name, 0, ce->name);
+	else
+		opts->hit |= grep_file(opts->opt, ce->name);
+	if (opts->hit && opts->opt->status_only)
+		return 1;
+	return 0;
 }
 
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
@@ -895,10 +887,21 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 	} else if (0 <= opt_exclude) {
 		die(_("--[no-]exclude-standard cannot be used for tracked contents."));
 	} else if (!list.nr) {
+		struct grep_opts opts;
+		struct filter_opts *filter_opts = xmalloc(sizeof(*filter_opts));
+
 		if (!cached)
 			setup_work_tree();
 
-		hit = grep_cache(&opt, &pathspec, cached);
+		memset(filter_opts, 0, sizeof(*filter_opts));
+		filter_opts->pathspec = &pathspec;
+		opts.opt = &opt;
+		opts.pathspec = &pathspec;
+		opts.cached = cached;
+		opts.hit = 0;
+		read_cache_filtered(filter_opts);
+		for_each_cache_entry(grep_cache, &opts);
+		hit = opts.hit;
 	} else {
 		if (cached)
 			die(_("both --cached and trees are given."));
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 09/19] ls-files.c: use index api
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (7 preceding siblings ...)
  2013-07-12 17:26 ` [PATCH v2 08/19] grep.c: Use index api Thomas Gummerer
@ 2013-07-12 17:26 ` Thomas Gummerer
  2013-07-14  3:39   ` Duy Nguyen
  2013-07-12 17:26 ` [PATCH v2 10/19] documentation: add documentation of the index-v5 file format Thomas Gummerer
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Use the index api to read only part of the index, if the on-disk version
of the index is index-v5.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 builtin/ls-files.c | 31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index 08d9786..80cc398 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -31,6 +31,7 @@ static const char *prefix;
 static int max_prefix_len;
 static int prefix_len;
 static const char **pathspec;
+static struct pathspec pathspec_struct;
 static int error_unmatch;
 static char *ps_matched;
 static const char *with_tree;
@@ -457,6 +458,7 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 	struct dir_struct dir;
 	struct exclude_list *el;
 	struct string_list exclude_list = STRING_LIST_INIT_NODUP;
+	struct filter_opts *opts = xmalloc(sizeof(*opts));
 	struct option builtin_ls_files_options[] = {
 		{ OPTION_CALLBACK, 'z', NULL, NULL, NULL,
 			N_("paths are separated with NUL character"),
@@ -522,9 +524,6 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 		prefix_len = strlen(prefix);
 	git_config(git_default_config, NULL);
 
-	if (read_cache() < 0)
-		die("index file corrupt");
-
 	argc = parse_options(argc, argv, prefix, builtin_ls_files_options,
 			ls_files_usage, 0);
 	el = add_exclude_list(&dir, EXC_CMDL, "--exclude option");
@@ -556,14 +555,7 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 		setup_work_tree();
 
 	pathspec = get_pathspec(prefix, argv);
-
-	/* be nice with submodule paths ending in a slash */
-	if (pathspec)
-		strip_trailing_slash_from_submodules();
-
-	/* Find common prefix for all pathspec's */
-	max_prefix = common_prefix(pathspec);
-	max_prefix_len = max_prefix ? strlen(max_prefix) : 0;
+	init_pathspec(&pathspec_struct, pathspec);
 
 	/* Treat unmatching pathspec elements as errors */
 	if (pathspec && error_unmatch) {
@@ -573,6 +565,23 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 		ps_matched = xcalloc(1, num);
 	}
 
+	if (!with_tree) {
+		memset(opts, 0, sizeof(*opts));
+		opts->pathspec = &pathspec_struct;
+		opts->read_staged = 1;
+		if (show_resolve_undo)
+			opts->read_resolve_undo = 1;
+		read_cache_filtered(opts);
+	} else {
+		read_cache();
+	}
+	/* be nice with submodule paths ending in a slash */
+	if (pathspec)
+		strip_trailing_slash_from_submodules();
+
+	max_prefix = common_prefix(pathspec);
+	max_prefix_len = max_prefix ? strlen(max_prefix) : 0;
+
 	if ((dir.flags & DIR_SHOW_IGNORED) && !exc_given)
 		die("ls-files --ignored needs some exclude pattern");
 
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 10/19] documentation: add documentation of the index-v5 file format
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (8 preceding siblings ...)
  2013-07-12 17:26 ` [PATCH v2 09/19] ls-files.c: use " Thomas Gummerer
@ 2013-07-12 17:26 ` Thomas Gummerer
  2013-07-14  3:59   ` Duy Nguyen
  2013-08-04 11:26   ` Duy Nguyen
  2013-07-12 17:26 ` [PATCH v2 11/19] read-cache: make in-memory format aware of stat_crc Thomas Gummerer
                   ` (10 subsequent siblings)
  20 siblings, 2 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Add a documentation of the index file format version 5 to
Documentation/technical.

Helped-by: Michael Haggerty <mhagger@alum.mit.edu>
Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Thomas Rast <trast@student.ethz.ch>
Helped-by: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Helped-by: Robin Rosenberg <robin.rosenberg@dewire.com>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 Documentation/technical/index-file-format-v5.txt | 296 +++++++++++++++++++++++
 1 file changed, 296 insertions(+)
 create mode 100644 Documentation/technical/index-file-format-v5.txt

diff --git a/Documentation/technical/index-file-format-v5.txt b/Documentation/technical/index-file-format-v5.txt
new file mode 100644
index 0000000..4213087
--- /dev/null
+++ b/Documentation/technical/index-file-format-v5.txt
@@ -0,0 +1,296 @@
+GIT index format
+================
+
+== The git index
+
+   The git index file (.git/index) documents the status of the files
+     in the git staging area.
+
+   The staging area is used for preparing commits, merging, etc.
+
+== The git index file format
+
+   All binary numbers are in network byte order. Version 5 is described
+     here. The index file consists of various sections. They appear in
+     the following order in the file.
+
+   - header: the description of the index format, including it's signature,
+     version and various other fields that are used internally.
+
+   - diroffsets (ndir entries of "direcotry offset"): A 4-byte offset
+       relative to the beginning of the "direntries block" (see below)
+       for each of the ndir directories in the index, sorted by pathname
+       (of the directory it's pointing to). [1]
+
+   - direntries (ndir entries of "directory offset"): A directory entry
+       for each of the ndir directories in the index, sorted by pathname
+       (see below). [2]
+
+   - fileoffsets (nfile entries of "file offset"): A 4-byte offset
+       relative to the beginning of the fileentries block (see below)
+       for each of the nfile files in the index. [1]
+
+   - fileentries (nfile entries of "file entry"): A file entry for
+       each of the nfile files in the index (see below).
+
+   - crdata: A number of entries for conflicted data/resolved conflicts
+       (see below).
+
+   - Extensions (Currently none, see below in the future)
+
+     Extensions are identified by signature. Optional extensions can
+     be ignored if GIT does not understand them.
+
+     GIT supports an arbitrary number of extension, but currently none
+     is implemented. [3]
+
+     extsig (32-bits): extension signature. If the first byte is 'A'..'Z'
+     the extension is optional and can be ignored.
+
+     extsize (32-bits): size of the extension, excluding the header
+       (extsig, extsize, extchecksum).
+
+     extchecksum (32-bits): crc32 checksum of the extension signature
+       and size.
+
+    - Extension data.
+
+== Header
+   sig (32-bits): Signature:
+     The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
+
+   vnr (32-bits): Version number:
+     The current supported versions are 2, 3, 4 and 5.
+
+   ndir (32-bits): number of directories in the index.
+
+   nfile (32-bits): number of file entries in the index.
+
+   fblockoffset (32-bits): offset to the file block, relative to the
+     beginning of the file.
+
+   - Offset to the extensions.
+
+     nextensions (32-bits): number of extensions.
+
+     extoffset (32-bits): offset to the extension. (Possibly none, as
+       many as indicated in the 4-byte number of extensions)
+
+   headercrc (32-bits): crc checksum including the header and the
+     offsets to the extensions.
+
+
+== Directory offsets (diroffsets)
+
+  diroffset (32-bits): offset to the directory relative to the beginning
+    of the index file. There are ndir + 1 offsets in the diroffset table,
+    the last is pointing to the end of the last direntry. With this last
+    entry, we are able to replace the strlen of when reading the directory
+    name, by calculating it from diroffset[n+1]-diroffset[n]-61.  61 is the
+    size of the directory data, which follows each each directory + the
+    crc sum + the NUL byte.
+
+  This part is needed for making the directory entries bisectable and
+    thus allowing a binary search.
+
+== Directory entry (direntries)
+
+  Directory entries are sorted in lexicographic order by the name
+    of their path starting with the root.
+
+  pathname (variable length, nul terminated): relative to top level
+    directory (without the leading slash). '/' is used as path
+    separator. A string of length 0 ('') indicates the root directory.
+    The special path components ".", and ".." (without quotes) are
+    disallowed. The path also includes a trailing slash. [9]
+
+  foffset (32-bits): offset to the lexicographically first file in
+    the file offsets (fileoffsets), relative to the beginning of
+    the fileoffset block.
+
+  cr (32-bits): offset to conflicted/resolved data at the end of the
+    index. 0 if there is no such data. [4]
+
+  ncr (32-bits): number of conflicted/resolved data entries at the
+    end of the index if the offset is non 0. If cr is 0, ncr is
+    also 0.
+
+  nsubtrees (32-bits): number of subtrees this tree has in the index.
+
+  nfiles (32-bits): number of files in the directory, that are in
+    the index.
+
+  nentries (32-bits): number of entries in the index that is covered
+    by the tree this entry represents. (-1 if the entry is invalid).
+    This number includes all the files in this tree, recursively.
+
+  objname (160-bits): object name for the object that would result
+    from writing this span of index as a tree. This is only valid
+    if nentries is valid, meaning the cache-tree is valid.
+
+  flags (16-bits): 'flags' field split into (high to low bits) (For
+    D/F conflicts)
+
+    stage (2-bits): stage of the directory during merge
+
+    14-bit unused
+
+  dircrc (32-bits): crc32 checksum for each directory entry.
+
+  The last 24 bytes (4-byte number of entries + 160-bit object name) are
+    for the cache tree. An entry can be in an invalidated state which is
+    represented by having -1 in the entry_count field.
+
+  The entries are written out in the top-down, depth-first order. The
+    first entry represents the root level of the repository, followed by
+    the first subtree - let's call it A - of the root level, followed by
+    the first subtree of A, ... There is no prefix compression for
+    directories.
+
+== File offsets (fileoffsets)
+
+  fileoffset (32-bits): offset to the file relative to the beginning of
+    the fileentries block.
+
+  This part is needed for making the file entries bisectable and
+    thus allowing a binary search. There are nfile + 1 offsets in the
+    fileoffset table, the last is pointing to the end of the last
+    fileentry. With this last entry, we can replace the strlen when
+    reading each filename, by calculating its length with the offsets.
+
+== File entry (fileentries)
+
+  File entries are sorted in ascending order on the name field, after the
+  respective offset given by the directory entries. All file names are
+  prefix compressed, meaning the file name is relative to the directory.
+
+  filename (variable length, nul terminated). The exact encoding is
+    undefined, but the filename cannot contain a NUL byte (iow, the same
+    encoding as a UNIX pathname).
+
+  flags (16-bits): 'flags' field split into (high to low bits)
+
+    assumevalid (1-bit): assume-valid flag
+
+    intenttoadd (1-bit): intent-to-add flag, used by "git add -N".
+      Extended flag in index v3.
+
+    stage (2-bit): stage of the file during merge
+
+    skipworktree (1-bit): skip-worktree flag, used by sparse checkout.
+      Extended flag in index v3.
+
+    smudged (1-bit): indicates if the file is racily smudged.
+
+    10-bit unused, must be zero [6]
+
+  mode (16-bits): file mode, split into (high to low bits)
+
+    objtype (4-bits): object type
+      valid values in binary are 1000 (regular file), 1010 (symbolic
+      link) and 1110 (gitlink)
+
+    3-bit unused
+
+    permission (9-bits): unix permission. Only 0755 and 0644 are valid
+      for regular files. Symbolic links and gitlinks have value 0 in
+      this field.
+
+  mtimes (32-bits): mtime seconds, the last time a file's data changed
+    this is stat(2) data
+
+  mtimens (32-bits): mtime nanosecond fractions
+    this is stat(2) data
+
+  file size (32-bits): The on-disk size, trucated to 32-bit.
+    this is stat(2) data
+
+  statcrc (32-bits): crc32 checksum over ctime seconds, ctime
+    nanoseconds, ino, dev, uid, gid (All stat(2) data
+    except mtime and file size). If the statcrc is 0 it will
+    be ignored. [7]
+
+  objhash (160-bits): SHA-1 for the represented object
+
+  entrycrc (32-bits): crc32 checksum for the file entry. The crc code
+    includes the offset to the offset to the file, relative to the
+    beginning of the file.
+
+== Conflict data
+
+  A conflict is represented in the index as a set of higher stage entries.
+  These entries are stored at the end of the index. When a conflict is
+  resolved (e.g. with "git add path"). A bit is flipped, to indicate that
+  the conflict is resolved, but the entries will be kept, so that
+  conflicts can be recreated (e.g. with "git checkout -m", in case users
+  want to redo a conflict resolution from scratch.
+
+  The first part of a conflict (usually stage 1) will be stored both in
+  the entries part of the index and in the conflict part. All other parts
+  will only be stored in the conflict part.
+
+  filename (variable length, nul terminated): filename of the entry,
+    relative to its containing directory).
+
+  nfileconflicts (32-bits): number of conflicts for the file [8]
+
+  flags (nfileconflicts entries of "flags") (16-bits): 'flags' field
+    split into:
+
+    conflicted (1-bit): conflicted state (conflicted/resolved) (1 if
+      conflicted)
+
+    stage (2-bits): stage during merge.
+
+    13-bit unused
+
+  entry_mode (nfileconflicts entries of "entry mode") (16-bits):
+    octal numbers, entry mode of eache entry in the different stages.
+    (How many is defined by the 4-byte number before)
+
+  objectnames (nfileconflicts entries of "object name") (160-bits):
+    object names  of the different stages.
+
+  conflictcrc (32-bits): crc32 checksum over conflict data.
+
+== Design explanations
+
+[1] The directory and file offsets are included in the index format
+    to enable bisectability of the index, for binary searches.Updating
+    a single entry and partial reading will benefit from this.
+
+[2] The directories are saved in their own block, to be able to
+    quickly search for a directory in the index. They include a
+    offset to the (lexically) first file in the directory.
+
+[3] The data of the cache-tree extension and the resolve undo
+    extension is now part of the index itself, but if other extensions
+    come up in the future, there is no need to change the index, they
+    can simply be added at the end.
+
+[4] To avoid rewrites of the whole index when there are conflicts or
+    conflicts are being resolved, conflicted data will be stored at
+    the end of the index. To mark the conflict resolved, just a bit
+    has to be flipped. The data will still be there, if a user wants
+    to redo the conflict resolution.
+
+[5] Since only 4 modes are effectively allowed in git but 32-bit are
+    used to store them, having a two bit flag for the mode is enough
+    and saves 4 byte per entry.
+
+[6] The length of the file name was dropped, since each file name is
+    nul terminated anyway.
+
+[7] Since all stat data (except mtime and ctime) is just used for
+    checking if a file has changed a checksum of the data is enough.
+    In addition to that Thomas Rast suggested ctime could be ditched
+    completely (core.trustctime=false) and thus included in the
+    checksum. This would save 24 bytes per index entry, which would
+    be about 4 MB on the Webkit index.
+    (Thanks for the suggestion to Michael Haggerty)
+
+[8] Since there can be more stage #1 entries, it is necessary to know
+    the number of conflict data entries there are.
+
+[9] As Michael Haggerty pointed out on the mailing list, storing the
+    trailing slash will simplify a few operations.
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 11/19] read-cache: make in-memory format aware of stat_crc
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (9 preceding siblings ...)
  2013-07-12 17:26 ` [PATCH v2 10/19] documentation: add documentation of the index-v5 file format Thomas Gummerer
@ 2013-07-12 17:26 ` Thomas Gummerer
  2013-07-12 17:26 ` [PATCH v2 12/19] read-cache: read index-v5 Thomas Gummerer
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Make the in-memory format aware of the stat_crc used by index-v5.
It is simply ignored by index version prior to v5.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 cache.h      |  1 +
 read-cache.c | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/cache.h b/cache.h
index 455b772..2097105 100644
--- a/cache.h
+++ b/cache.h
@@ -127,6 +127,7 @@ struct cache_entry {
 	unsigned int ce_flags;
 	unsigned int ce_namelen;
 	unsigned char sha1[20];
+	uint32_t ce_stat_crc;
 	struct cache_entry *next; /* used by name_hash */
 	char name[FLEX_ARRAY]; /* more */
 };
diff --git a/read-cache.c b/read-cache.c
index ab716ed..9bfbb4f 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -108,6 +108,29 @@ int match_stat_data(const struct stat_data *sd, struct stat *st)
 	return changed;
 }
 
+static uint32_t calculate_stat_crc(struct cache_entry *ce)
+{
+	unsigned int ctimens = 0;
+	uint32_t stat, stat_crc;
+
+	stat = htonl(ce->ce_stat_data.sd_ctime.sec);
+	stat_crc = crc32(0, (Bytef*)&stat, 4);
+#ifdef USE_NSEC
+	ctimens = ce->ce_stat_data.sd_ctime.nsec;
+#endif
+	stat = htonl(ctimens);
+	stat_crc = crc32(stat_crc, (Bytef*)&stat, 4);
+	stat = htonl(ce->ce_stat_data.sd_ino);
+	stat_crc = crc32(stat_crc, (Bytef*)&stat, 4);
+	stat = htonl(ce->ce_stat_data.sd_dev);
+	stat_crc = crc32(stat_crc, (Bytef*)&stat, 4);
+	stat = htonl(ce->ce_stat_data.sd_uid);
+	stat_crc = crc32(stat_crc, (Bytef*)&stat, 4);
+	stat = htonl(ce->ce_stat_data.sd_gid);
+	stat_crc = crc32(stat_crc, (Bytef*)&stat, 4);
+	return stat_crc;
+}
+
 /*
  * This only updates the "non-critical" parts of the directory
  * cache, ie the parts that aren't tracked by GIT, and only used
@@ -122,6 +145,8 @@ void fill_stat_cache_info(struct cache_entry *ce, struct stat *st)
 
 	if (S_ISREG(st->st_mode))
 		ce_mark_uptodate(ce);
+
+	ce->ce_stat_crc = calculate_stat_crc(ce);
 }
 
 static int ce_compare_data(const struct cache_entry *ce, struct stat *st)
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 12/19] read-cache: read index-v5
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (10 preceding siblings ...)
  2013-07-12 17:26 ` [PATCH v2 11/19] read-cache: make in-memory format aware of stat_crc Thomas Gummerer
@ 2013-07-12 17:26 ` Thomas Gummerer
  2013-07-14  4:42   ` Duy Nguyen
  2013-07-15 10:12   ` Duy Nguyen
  2013-07-12 17:26 ` [PATCH v2 13/19] read-cache: read resolve-undo data Thomas Gummerer
                   ` (8 subsequent siblings)
  20 siblings, 2 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Make git read the index file version 5 without complaining.

This version of the reader doesn't read neither the cache-tree
nor the resolve undo data, but doesn't choke on an index that
includes such data.

Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Helped-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 Makefile        |   1 +
 cache.h         |  75 ++++++-
 read-cache-v5.c | 638 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 read-cache.h    |   1 +
 4 files changed, 714 insertions(+), 1 deletion(-)
 create mode 100644 read-cache-v5.c

diff --git a/Makefile b/Makefile
index 73369ae..80e35f5 100644
--- a/Makefile
+++ b/Makefile
@@ -856,6 +856,7 @@ LIB_OBJS += quote.o
 LIB_OBJS += reachable.o
 LIB_OBJS += read-cache.o
 LIB_OBJS += read-cache-v2.o
+LIB_OBJS += read-cache-v5.o
 LIB_OBJS += reflog-walk.o
 LIB_OBJS += refs.o
 LIB_OBJS += remote.o
diff --git a/cache.h b/cache.h
index 2097105..1e5cc77 100644
--- a/cache.h
+++ b/cache.h
@@ -99,7 +99,7 @@ unsigned long git_deflate_bound(git_zstream *, unsigned long);
 #define CACHE_SIGNATURE 0x44495243	/* "DIRC" */
 
 #define INDEX_FORMAT_LB 2
-#define INDEX_FORMAT_UB 4
+#define INDEX_FORMAT_UB 5
 
 /*
  * The "cache_time" is just the low 32 bits of the
@@ -121,6 +121,15 @@ struct stat_data {
 	unsigned int sd_size;
 };
 
+/*
+ * The *next pointer is used in read_entries_v5 for holding
+ * all the elements of a directory, and points to the next
+ * cache_entry in a directory.
+ *
+ * It is reset by the add_name_hash call in set_index_entry
+ * to set it to point to the next cache_entry in the
+ * correct in-memory format ordering.
+ */
 struct cache_entry {
 	struct stat_data ce_stat_data;
 	unsigned int ce_mode;
@@ -132,11 +141,59 @@ struct cache_entry {
 	char name[FLEX_ARRAY]; /* more */
 };
 
+struct directory_entry {
+	struct directory_entry *next;
+	struct directory_entry *next_hash;
+	struct cache_entry *ce;
+	struct cache_entry *ce_last;
+	struct conflict_entry *conflict;
+	struct conflict_entry *conflict_last;
+	unsigned int conflict_size;
+	unsigned int de_foffset;
+	unsigned int de_cr;
+	unsigned int de_ncr;
+	unsigned int de_nsubtrees;
+	unsigned int de_nfiles;
+	unsigned int de_nentries;
+	unsigned char sha1[20];
+	unsigned short de_flags;
+	unsigned int de_pathlen;
+	char pathname[FLEX_ARRAY];
+};
+
+struct conflict_part {
+	struct conflict_part *next;
+	unsigned short flags;
+	unsigned short entry_mode;
+	unsigned char sha1[20];
+};
+
+struct conflict_entry {
+	struct conflict_entry *next;
+	unsigned int nfileconflicts;
+	struct conflict_part *entries;
+	unsigned int namelen;
+	unsigned int pathlen;
+	char name[FLEX_ARRAY];
+};
+
+struct ondisk_conflict_part {
+	unsigned short flags;
+	unsigned short entry_mode;
+	unsigned char sha1[20];
+};
+
+#define CE_NAMEMASK  (0x0fff)
 #define CE_STAGEMASK (0x3000)
 #define CE_EXTENDED  (0x4000)
 #define CE_VALID     (0x8000)
+#define CE_SMUDGED   (0x0400) /* index v5 only flag */
 #define CE_STAGESHIFT 12
 
+#define CONFLICT_CONFLICTED (0x8000)
+#define CONFLICT_STAGESHIFT 13
+#define CONFLICT_STAGEMASK (0x6000)
+
 /*
  * Range 0xFFFF0000 in ce_flags is divided into
  * two parts: in-memory flags and on-disk ones.
@@ -173,6 +230,18 @@ struct cache_entry {
 #define CE_EXTENDED_FLAGS (CE_INTENT_TO_ADD | CE_SKIP_WORKTREE)
 
 /*
+ * Representation of the extended on-disk flags in the v5 format.
+ * They must not collide with the ordinary on-disk flags, and need to
+ * fit in 16 bits.  Note however that v5 does not save the name
+ * length.
+ */
+#define CE_INTENT_TO_ADD_V5  (0x4000)
+#define CE_SKIP_WORKTREE_V5  (0x0800)
+#if (CE_VALID|CE_STAGEMASK) & (CE_INTENTTOADD_V5|CE_SKIPWORKTREE_V5)
+#error "v5 on-disk flags collide with ordinary on-disk flags"
+#endif
+
+/*
  * Safeguard to avoid saving wrong flags:
  *  - CE_EXTENDED2 won't get saved until its semantic is known
  *  - Bits in 0x0000FFFF have been saved in ce_flags already
@@ -211,6 +280,8 @@ static inline unsigned create_ce_flags(unsigned stage)
 #define ce_skip_worktree(ce) ((ce)->ce_flags & CE_SKIP_WORKTREE)
 #define ce_mark_uptodate(ce) ((ce)->ce_flags |= CE_UPTODATE)
 
+#define conflict_stage(c) ((CONFLICT_STAGEMASK & (c)->flags) >> CONFLICT_STAGESHIFT)
+
 #define ce_permissions(mode) (((mode) & 0100) ? 0755 : 0644)
 static inline unsigned int create_ce_mode(unsigned int mode)
 {
@@ -258,6 +329,8 @@ static inline unsigned int canon_mode(unsigned int mode)
 }
 
 #define cache_entry_size(len) (offsetof(struct cache_entry,name) + (len) + 1)
+#define directory_entry_size(len) (offsetof(struct directory_entry,pathname) + (len) + 1)
+#define conflict_entry_size(len) (offsetof(struct conflict_entry,name) + (len) + 1)
 
 /*
  * Options by which the index should be filtered when read partially.
diff --git a/read-cache-v5.c b/read-cache-v5.c
new file mode 100644
index 0000000..00112ea
--- /dev/null
+++ b/read-cache-v5.c
@@ -0,0 +1,638 @@
+#include "cache.h"
+#include "read-cache.h"
+#include "resolve-undo.h"
+#include "cache-tree.h"
+#include "dir.h"
+
+#define ptr_add(x,y) ((void *)(((char *)(x)) + (y)))
+
+struct cache_header {
+	unsigned int hdr_ndir;
+	unsigned int hdr_nfile;
+	unsigned int hdr_fblockoffset;
+	unsigned int hdr_nextension;
+};
+
+/*****************************************************************
+ * Index File I/O
+ *****************************************************************/
+
+struct ondisk_cache_entry {
+	unsigned short flags;
+	unsigned short mode;
+	struct cache_time mtime;
+	unsigned int size;
+	int stat_crc;
+	unsigned char sha1[20];
+};
+
+struct ondisk_directory_entry {
+	unsigned int foffset;
+	unsigned int cr;
+	unsigned int ncr;
+	unsigned int nsubtrees;
+	unsigned int nfiles;
+	unsigned int nentries;
+	unsigned char sha1[20];
+	unsigned short flags;
+};
+
+static int check_crc32(int initialcrc,
+			void *data,
+			size_t len,
+			unsigned int expected_crc)
+{
+	int crc;
+
+	crc = crc32(initialcrc, (Bytef*)data, len);
+	return crc == expected_crc;
+}
+
+static int match_stat_crc(struct stat *st, uint32_t expected_crc)
+{
+	uint32_t data, stat_crc = 0;
+	unsigned int ctimens = 0;
+
+	data = htonl(st->st_ctime);
+	stat_crc = crc32(0, (Bytef*)&data, 4);
+#ifdef USE_NSEC
+	ctimens = ST_CTIME_NSEC(*st);
+#endif
+	data = htonl(ctimens);
+	stat_crc = crc32(stat_crc, (Bytef*)&data, 4);
+	data = htonl(st->st_ino);
+	stat_crc = crc32(stat_crc, (Bytef*)&data, 4);
+	data = htonl(st->st_dev);
+	stat_crc = crc32(stat_crc, (Bytef*)&data, 4);
+	data = htonl(st->st_uid);
+	stat_crc = crc32(stat_crc, (Bytef*)&data, 4);
+	data = htonl(st->st_gid);
+	stat_crc = crc32(stat_crc, (Bytef*)&data, 4);
+
+	return stat_crc == expected_crc;
+}
+
+static int match_stat_basic(const struct cache_entry *ce,
+			    struct stat *st,
+			    int changed)
+{
+
+	if (ce->ce_stat_data.sd_mtime.sec != (unsigned int)st->st_mtime)
+		changed |= MTIME_CHANGED;
+#ifdef USE_NSEC
+	if (ce->ce_stat_data.sd_mtime.nsec != ST_MTIME_NSEC(*st))
+		changed |= MTIME_CHANGED;
+#endif
+	if (ce->ce_stat_data.sd_size != (unsigned int)st->st_size)
+		changed |= DATA_CHANGED;
+
+	if (trust_ctime && ce->ce_stat_crc != 0 && !match_stat_crc(st, ce->ce_stat_crc)) {
+		changed |= OWNER_CHANGED;
+		changed |= INODE_CHANGED;
+	}
+	/* Racily smudged entry? */
+	if (ce->ce_flags & CE_SMUDGED) {
+		if (!changed && !is_empty_blob_sha1(ce->sha1) && ce_modified_check_fs(ce, st))
+			changed |= DATA_CHANGED;
+	}
+	return changed;
+}
+
+static int verify_hdr(void *mmap, unsigned long size)
+{
+	uint32_t *filecrc;
+	unsigned int header_size;
+	struct cache_version_header *hdr;
+	struct cache_header *hdr_v5;
+
+	if (size < sizeof(struct cache_version_header)
+			+ sizeof (struct cache_header) + 4)
+		die("index file smaller than expected");
+
+	hdr = mmap;
+	hdr_v5 = ptr_add(mmap, sizeof(*hdr));
+	/* Size of the header + the size of the extensionoffsets */
+	header_size = sizeof(*hdr) + sizeof(*hdr_v5) + hdr_v5->hdr_nextension * 4;
+	/* Initialize crc */
+	filecrc = ptr_add(mmap, header_size);
+	if (!check_crc32(0, hdr, header_size, ntohl(*filecrc)))
+		return error("bad index file header crc signature");
+	return 0;
+}
+
+static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk,
+						   struct directory_entry *de,
+						   char *name,
+						   size_t len,
+						   size_t prefix_len)
+{
+	struct cache_entry *ce = xmalloc(cache_entry_size(len + de->de_pathlen));
+	int flags;
+
+	flags = ntoh_s(ondisk->flags);
+	ce->ce_stat_data.sd_ctime.sec  = 0;
+	ce->ce_stat_data.sd_mtime.sec  = ntoh_l(ondisk->mtime.sec);
+	ce->ce_stat_data.sd_ctime.nsec = 0;
+	ce->ce_stat_data.sd_mtime.nsec = ntoh_l(ondisk->mtime.nsec);
+	ce->ce_stat_data.sd_dev        = 0;
+	ce->ce_stat_data.sd_ino        = 0;
+	ce->ce_stat_data.sd_uid        = 0;
+	ce->ce_stat_data.sd_gid        = 0;
+	ce->ce_stat_data.sd_size       = ntoh_l(ondisk->size);
+	ce->ce_mode       = ntoh_s(ondisk->mode);
+	ce->ce_flags      = flags & CE_STAGEMASK;
+	ce->ce_flags     |= flags & CE_VALID;
+	ce->ce_flags     |= flags & CE_SMUDGED;
+	if (flags & CE_INTENT_TO_ADD_V5)
+		ce->ce_flags |= CE_INTENT_TO_ADD;
+	if (flags & CE_SKIP_WORKTREE_V5)
+		ce->ce_flags |= CE_SKIP_WORKTREE;
+	ce->ce_stat_crc   = ntoh_l(ondisk->stat_crc);
+	ce->ce_namelen    = len + de->de_pathlen;
+	hashcpy(ce->sha1, ondisk->sha1);
+	memcpy(ce->name, de->pathname, de->de_pathlen);
+	memcpy(ce->name + de->de_pathlen, name, len);
+	ce->name[len + de->de_pathlen] = '\0';
+	return ce;
+}
+
+static struct directory_entry *directory_entry_from_ondisk(struct ondisk_directory_entry *ondisk,
+						   const char *name,
+						   size_t len)
+{
+	struct directory_entry *de = xmalloc(directory_entry_size(len));
+
+
+	memcpy(de->pathname, name, len);
+	de->pathname[len] = '\0';
+	de->de_flags      = ntoh_s(ondisk->flags);
+	de->de_foffset    = ntoh_l(ondisk->foffset);
+	de->de_cr         = ntoh_l(ondisk->cr);
+	de->de_ncr        = ntoh_l(ondisk->ncr);
+	de->de_nsubtrees  = ntoh_l(ondisk->nsubtrees);
+	de->de_nfiles     = ntoh_l(ondisk->nfiles);
+	de->de_nentries   = ntoh_l(ondisk->nentries);
+	de->de_pathlen    = len;
+	hashcpy(de->sha1, ondisk->sha1);
+	return de;
+}
+
+static struct conflict_part *conflict_part_from_ondisk(struct ondisk_conflict_part *ondisk)
+{
+	struct conflict_part *cp = xmalloc(sizeof(struct conflict_part));
+
+	cp->flags      = ntoh_s(ondisk->flags);
+	cp->entry_mode = ntoh_s(ondisk->entry_mode);
+	hashcpy(cp->sha1, ondisk->sha1);
+	return cp;
+}
+
+static struct cache_entry *convert_conflict_part(struct conflict_part *cp,
+						char * name,
+						unsigned int len)
+{
+
+	struct cache_entry *ce = xmalloc(cache_entry_size(len));
+
+	ce->ce_stat_data.sd_ctime.sec  = 0;
+	ce->ce_stat_data.sd_mtime.sec  = 0;
+	ce->ce_stat_data.sd_ctime.nsec = 0;
+	ce->ce_stat_data.sd_mtime.nsec = 0;
+	ce->ce_stat_data.sd_dev        = 0;
+	ce->ce_stat_data.sd_ino        = 0;
+	ce->ce_stat_data.sd_uid        = 0;
+	ce->ce_stat_data.sd_gid        = 0;
+	ce->ce_stat_data.sd_size       = 0;
+	ce->ce_mode       = cp->entry_mode;
+	ce->ce_flags      = conflict_stage(cp) << CE_STAGESHIFT;
+	ce->ce_stat_crc   = 0;
+	ce->ce_namelen    = len;
+	hashcpy(ce->sha1, cp->sha1);
+	memcpy(ce->name, name, len);
+	ce->name[len] = '\0';
+	return ce;
+}
+
+static struct directory_entry *read_directories(unsigned int *dir_offset,
+				unsigned int *dir_table_offset,
+				void *mmap,
+				int mmap_size)
+{
+	int i, ondisk_directory_size;
+	uint32_t *filecrc, *beginning, *end;
+	struct directory_entry *current = NULL;
+	struct ondisk_directory_entry *disk_de;
+	struct directory_entry *de;
+	unsigned int data_len, len;
+	char *name;
+
+	/*
+	 * Length of pathname + nul byte for termination + size of
+	 * members of ondisk_directory_entry. (Just using the size
+	 * of the struct doesn't work, because there may be padding
+	 * bytes for the struct)
+	 */
+	ondisk_directory_size = sizeof(disk_de->flags)
+		+ sizeof(disk_de->foffset)
+		+ sizeof(disk_de->cr)
+		+ sizeof(disk_de->ncr)
+		+ sizeof(disk_de->nsubtrees)
+		+ sizeof(disk_de->nfiles)
+		+ sizeof(disk_de->nentries)
+		+ sizeof(disk_de->sha1);
+	name = ptr_add(mmap, *dir_offset);
+	beginning = ptr_add(mmap, *dir_table_offset);
+	end = ptr_add(mmap, *dir_table_offset + 4);
+	len = ntoh_l(*end) - ntoh_l(*beginning) - ondisk_directory_size - 5;
+	disk_de = ptr_add(mmap, *dir_offset + len + 1);
+	de = directory_entry_from_ondisk(disk_de, name, len);
+	de->next = NULL;
+
+	data_len = len + 1 + ondisk_directory_size;
+	filecrc = ptr_add(mmap, *dir_offset + data_len);
+	if (!check_crc32(0, ptr_add(mmap, *dir_offset), data_len, ntoh_l(*filecrc)))
+		goto unmap;
+
+	*dir_table_offset += 4;
+	*dir_offset += data_len + 4; /* crc code */
+
+	current = de;
+	for (i = 0; i < de->de_nsubtrees; i++) {
+		current->next = read_directories(dir_offset, dir_table_offset,
+						mmap, mmap_size);
+		while (current->next)
+			current = current->next;
+	}
+
+	return de;
+unmap:
+	munmap(mmap, mmap_size);
+	die("directory crc doesn't match for '%s'", de->pathname);
+}
+
+static int read_entry(struct cache_entry **ce, struct directory_entry *de,
+		      unsigned int *entry_offset,
+		      void **mmap, unsigned long mmap_size,
+		      unsigned int *foffsetblock)
+{
+	int len, offset_to_offset;
+	char *name;
+	uint32_t foffsetblockcrc;
+	uint32_t *filecrc, *beginning, *end;
+	struct ondisk_cache_entry *disk_ce;
+
+	name = ptr_add(*mmap, *entry_offset);
+	beginning = ptr_add(*mmap, *foffsetblock);
+	end = ptr_add(*mmap, *foffsetblock + 4);
+	len = ntoh_l(*end) - ntoh_l(*beginning) - sizeof(struct ondisk_cache_entry) - 5;
+	disk_ce = ptr_add(*mmap, *entry_offset + len + 1);
+	*ce = cache_entry_from_ondisk(disk_ce, de, name, len, de->de_pathlen);
+	filecrc = ptr_add(*mmap, *entry_offset + len + 1 + sizeof(*disk_ce));
+	offset_to_offset = htonl(*foffsetblock);
+	foffsetblockcrc = crc32(0, (Bytef*)&offset_to_offset, 4);
+	if (!check_crc32(foffsetblockcrc,
+		ptr_add(*mmap, *entry_offset), len + 1 + sizeof(*disk_ce),
+		ntoh_l(*filecrc)))
+		return -1;
+
+	*entry_offset += len + 1 + sizeof(*disk_ce) + 4;
+	return 0;
+}
+
+static void ce_queue_push(struct cache_entry **head,
+			     struct cache_entry **tail,
+			     struct cache_entry *ce)
+{
+	if (!*head) {
+		*head = *tail = ce;
+		(*tail)->next = NULL;
+		return;
+	}
+
+	(*tail)->next = ce;
+	ce->next = NULL;
+	*tail = (*tail)->next;
+}
+
+static void conflict_entry_push(struct conflict_entry **head,
+				struct conflict_entry **tail,
+				struct conflict_entry *conflict_entry)
+{
+	if (!*head) {
+		*head = *tail = conflict_entry;
+		(*tail)->next = NULL;
+		return;
+	}
+
+	(*tail)->next = conflict_entry;
+	conflict_entry->next = NULL;
+	*tail = (*tail)->next;
+}
+
+static struct cache_entry *ce_queue_pop(struct cache_entry **head)
+{
+	struct cache_entry *ce;
+
+	ce = *head;
+	*head = (*head)->next;
+	return ce;
+}
+
+static void conflict_part_head_remove(struct conflict_part **head)
+{
+	struct conflict_part *to_free;
+
+	to_free = *head;
+	*head = (*head)->next;
+	free(to_free);
+}
+
+static void conflict_entry_head_remove(struct conflict_entry **head)
+{
+	struct conflict_entry *to_free;
+
+	to_free = *head;
+	*head = (*head)->next;
+	free(to_free);
+}
+
+struct conflict_entry *create_new_conflict(char *name, int len, int pathlen)
+{
+	struct conflict_entry *conflict_entry;
+
+	if (pathlen)
+		pathlen++;
+	conflict_entry = xmalloc(conflict_entry_size(len));
+	conflict_entry->entries = NULL;
+	conflict_entry->nfileconflicts = 0;
+	conflict_entry->namelen = len;
+	memcpy(conflict_entry->name, name, len);
+	conflict_entry->name[len] = '\0';
+	conflict_entry->pathlen = pathlen;
+	conflict_entry->next = NULL;
+
+	return conflict_entry;
+}
+
+void add_part_to_conflict_entry(struct directory_entry *de,
+					struct conflict_entry *entry,
+					struct conflict_part *conflict_part)
+{
+
+	struct conflict_part *conflict_search;
+
+	entry->nfileconflicts++;
+	de->conflict_size += sizeof(struct ondisk_conflict_part);
+	if (!entry->entries)
+		entry->entries = conflict_part;
+	else {
+		conflict_search = entry->entries;
+		while (conflict_search->next)
+			conflict_search = conflict_search->next;
+		conflict_search->next = conflict_part;
+	}
+}
+
+static int read_conflicts(struct conflict_entry **head,
+			  struct directory_entry *de,
+			  void **mmap, unsigned long mmap_size)
+{
+	struct conflict_entry *tail;
+	unsigned int croffset, i;
+	char *full_name;
+
+	croffset = de->de_cr;
+	tail = NULL;
+	for (i = 0; i < de->de_ncr; i++) {
+		struct conflict_entry *conflict_new;
+		unsigned int len, *nfileconflicts;
+		char *name;
+		void *crc_start;
+		int k, offset;
+		uint32_t *filecrc;
+
+		offset = croffset;
+		crc_start = ptr_add(*mmap, offset);
+		name = ptr_add(*mmap, offset);
+		len = strlen(name);
+		offset += len + 1;
+		nfileconflicts = ptr_add(*mmap, offset);
+		offset += 4;
+
+		full_name = xmalloc(sizeof(char) * (len + de->de_pathlen));
+		memcpy(full_name, de->pathname, de->de_pathlen);
+		memcpy(full_name + de->de_pathlen, name, len);
+		conflict_new = create_new_conflict(full_name,
+				len + de->de_pathlen, de->de_pathlen);
+		for (k = 0; k < ntoh_l(*nfileconflicts); k++) {
+			struct ondisk_conflict_part *ondisk;
+			struct conflict_part *cp;
+
+			ondisk = ptr_add(*mmap, offset);
+			cp = conflict_part_from_ondisk(ondisk);
+			cp->next = NULL;
+			add_part_to_conflict_entry(de, conflict_new, cp);
+			offset += sizeof(struct ondisk_conflict_part);
+		}
+		filecrc = ptr_add(*mmap, offset);
+		free(full_name);
+		if (!check_crc32(0, crc_start,
+			len + 1 + 4 + conflict_new->nfileconflicts
+			* sizeof(struct ondisk_conflict_part),
+			ntoh_l(*filecrc)))
+			return -1;
+		croffset = offset + 4;
+		conflict_entry_push(head, &tail, conflict_new);
+	}
+	return 0;
+}
+
+static int read_entries(struct index_state *istate, struct directory_entry **de,
+			unsigned int *entry_offset, void **mmap,
+			unsigned long mmap_size, unsigned int *nr,
+			unsigned int *foffsetblock)
+{
+	struct cache_entry *head = NULL, *tail = NULL;
+	struct conflict_entry *conflict_queue;
+	struct cache_entry *ce;
+	int i;
+
+	conflict_queue = NULL;
+	if (read_conflicts(&conflict_queue, *de, mmap, mmap_size) < 0)
+		return -1;
+	for (i = 0; i < (*de)->de_nfiles; i++) {
+		if (read_entry(&ce,
+			       *de,
+			       entry_offset,
+			       mmap,
+			       mmap_size,
+			       foffsetblock) < 0)
+			return -1;
+		ce_queue_push(&head, &tail, ce);
+		*foffsetblock += 4;
+
+		/*
+		 * Add the conflicted entries at the end of the index file
+		 * to the in memory format
+		 */
+		if (conflict_queue &&
+		    (conflict_queue->entries->flags & CONFLICT_CONFLICTED) != 0 &&
+		    !cache_name_compare(conflict_queue->name, conflict_queue->namelen,
+					ce->name, ce_namelen(ce))) {
+			struct conflict_part *cp;
+			cp = conflict_queue->entries;
+			cp = cp->next;
+			while (cp) {
+				ce = convert_conflict_part(cp,
+						conflict_queue->name,
+						conflict_queue->namelen);
+				ce_queue_push(&head, &tail, ce);
+				conflict_part_head_remove(&cp);
+			}
+			conflict_entry_head_remove(&conflict_queue);
+		}
+	}
+
+	*de = (*de)->next;
+
+	while (head) {
+		if (*de != NULL
+		    && strcmp(head->name, (*de)->pathname) > 0) {
+			read_entries(istate,
+				     de,
+				     entry_offset,
+				     mmap,
+				     mmap_size,
+				     nr,
+				     foffsetblock);
+		} else {
+			ce = ce_queue_pop(&head);
+			set_index_entry(istate, *nr, ce);
+			(*nr)++;
+			ce->next = NULL;
+		}
+	}
+	return 0;
+}
+
+static struct directory_entry *read_head_directories(struct index_state *istate,
+						     unsigned int *entry_offset,
+						     unsigned int *foffsetblock,
+						     unsigned int *ndirs,
+						     void *mmap, unsigned long mmap_size)
+{
+	unsigned int dir_offset, dir_table_offset;
+	struct cache_version_header *hdr;
+	struct cache_header *hdr_v5;
+	struct directory_entry *root_directory;
+
+	hdr = mmap;
+	hdr_v5 = ptr_add(mmap, sizeof(*hdr));
+	istate->version = ntohl(hdr->hdr_version);
+	istate->cache_alloc = alloc_nr(ntohl(hdr_v5->hdr_nfile));
+	istate->cache = xcalloc(istate->cache_alloc, sizeof(struct cache_entry *));
+	istate->initialized = 1;
+
+	/* Skip size of the header + crc sum + size of offsets */
+	dir_offset = sizeof(*hdr) + sizeof(*hdr_v5) + 4 + (ntohl(hdr_v5->hdr_ndir) + 1) * 4;
+	dir_table_offset = sizeof(*hdr) + sizeof(*hdr_v5) + 4;
+	root_directory = read_directories(&dir_offset, &dir_table_offset,
+					  mmap, mmap_size);
+
+	*entry_offset = ntohl(hdr_v5->hdr_fblockoffset);
+	*foffsetblock = dir_offset;
+	*ndirs = ntohl(hdr_v5->hdr_ndir);
+	return root_directory;
+}
+
+static int read_index_filtered_v5(struct index_state *istate, void *mmap,
+				  unsigned long mmap_size, struct filter_opts *opts)
+{
+	unsigned int entry_offset, ndirs, foffsetblock, nr = 0;
+	struct directory_entry *root_directory, *de;
+	const char **adjusted_pathspec = NULL;
+	int need_root = 0, i, n;
+	char *oldpath, *seen;
+
+	root_directory = read_head_directories(istate, &entry_offset,
+					       &foffsetblock, &ndirs,
+					       mmap, mmap_size);
+
+	if (opts->pathspec->raw) {
+		need_root = 0;
+		seen = xcalloc(1, ndirs);
+		for (de = root_directory; de; de = de->next)
+			match_pathspec(opts->pathspec->raw, de->pathname, de->de_pathlen, 0, seen);
+		for (n = 0; opts->pathspec->raw[n]; n++)
+			/* just count */;
+		adjusted_pathspec = xmalloc((n+1)*sizeof(char *));
+		adjusted_pathspec[n] = NULL;
+		for (i = 0; i < n; i++) {
+			if (seen[i] == MATCHED_EXACTLY)
+				adjusted_pathspec[i] = opts->pathspec->raw[i];
+			else {
+				char *super = strdup(opts->pathspec->raw[i]);
+				int len = strlen(super);
+				while (len && super[len - 1] == '/')
+					super[--len] = '\0'; /* strip trailing / */
+				while (len && super[--len] != '/')
+					; /* scan backwards to next / */
+				if (len >= 0)
+					super[len--] = '\0';
+				if (len <= 0) {
+					need_root = 1;
+					break;
+				}
+				adjusted_pathspec[i] = super;
+			}
+		}
+	}
+
+	de = root_directory;
+	while (de) {
+		if (need_root ||
+		    match_pathspec(adjusted_pathspec, de->pathname, de->de_pathlen, 0, NULL)) {
+			unsigned int subdir_foffsetblock = de->de_foffset + foffsetblock;
+			unsigned int *off = mmap + subdir_foffsetblock;
+			unsigned int subdir_entry_offset = entry_offset + ntoh_l(*off);
+			oldpath = de->pathname;
+			do {
+				if (read_entries(istate, &de, &subdir_entry_offset,
+						 &mmap, mmap_size, &nr,
+						 &subdir_foffsetblock) < 0)
+					return -1;
+			} while (de && !prefixcmp(de->pathname, oldpath));
+		} else
+			de = de->next;
+	}
+	istate->cache_nr = nr;
+	return 0;
+}
+
+static int read_index_v5(struct index_state *istate, void *mmap,
+			 unsigned long mmap_size, struct filter_opts *opts)
+{
+	unsigned int entry_offset, ndirs, foffsetblock, nr = 0;
+	struct directory_entry *root_directory, *de;
+
+	if (opts != NULL)
+		return read_index_filtered_v5(istate, mmap, mmap_size, opts);
+
+	root_directory = read_head_directories(istate, &entry_offset,
+					       &foffsetblock, &ndirs,
+					       mmap, mmap_size);
+	de = root_directory;
+	while (de)
+		if (read_entries(istate, &de, &entry_offset, &mmap,
+				 mmap_size, &nr, &foffsetblock) < 0)
+			return -1;
+	istate->cache_nr = nr;
+	return 0;
+}
+
+struct index_ops v5_ops = {
+	match_stat_basic,
+	verify_hdr,
+	read_index_v5,
+	NULL
+};
diff --git a/read-cache.h b/read-cache.h
index e72a585..869c562 100644
--- a/read-cache.h
+++ b/read-cache.h
@@ -35,6 +35,7 @@ struct internal_ops {
 
 extern struct index_ops v2_ops;
 extern struct internal_ops v2_internal_ops;
+extern struct index_ops v5_ops;
 
 #ifndef NEEDS_ALIGNED_ACCESS
 #define ntoh_s(var) ntohs(var)
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 13/19] read-cache: read resolve-undo data
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (11 preceding siblings ...)
  2013-07-12 17:26 ` [PATCH v2 12/19] read-cache: read index-v5 Thomas Gummerer
@ 2013-07-12 17:26 ` Thomas Gummerer
  2013-07-12 17:26 ` [PATCH v2 14/19] read-cache: read cache-tree in index-v5 Thomas Gummerer
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Make git read the resolve-undo data from the index.

Since the resolve-undo data is joined with the conflicts in
the ondisk format of the index file version 5, conflicts and
resolved data is read at the same time, and the resolve-undo
data is then converted to the in-memory format.

Helped-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 read-cache-v5.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/read-cache-v5.c b/read-cache-v5.c
index 00112ea..853b97d 100644
--- a/read-cache-v5.c
+++ b/read-cache-v5.c
@@ -1,5 +1,6 @@
 #include "cache.h"
 #include "read-cache.h"
+#include "string-list.h"
 #include "resolve-undo.h"
 #include "cache-tree.h"
 #include "dir.h"
@@ -447,6 +448,43 @@ static int read_conflicts(struct conflict_entry **head,
 	return 0;
 }
 
+static void resolve_undo_convert_v5(struct index_state *istate,
+				    struct conflict_entry *conflict)
+{
+	int i;
+
+	while (conflict) {
+		struct string_list_item *lost;
+		struct resolve_undo_info *ui;
+		struct conflict_part *cp;
+
+		if (conflict->entries &&
+		    (conflict->entries->flags & CONFLICT_CONFLICTED) != 0) {
+			conflict = conflict->next;
+			continue;
+		}
+		if (!istate->resolve_undo) {
+			istate->resolve_undo = xcalloc(1, sizeof(struct string_list));
+			istate->resolve_undo->strdup_strings = 1;
+		}
+
+		lost = string_list_insert(istate->resolve_undo, conflict->name);
+		if (!lost->util)
+			lost->util = xcalloc(1, sizeof(*ui));
+		ui = lost->util;
+
+		cp = conflict->entries;
+		for (i = 0; i < 3; i++)
+			ui->mode[i] = 0;
+		while (cp) {
+			ui->mode[conflict_stage(cp) - 1] = cp->entry_mode;
+			hashcpy(ui->sha1[conflict_stage(cp) - 1], cp->sha1);
+			cp = cp->next;
+		}
+		conflict = conflict->next;
+	}
+}
+
 static int read_entries(struct index_state *istate, struct directory_entry **de,
 			unsigned int *entry_offset, void **mmap,
 			unsigned long mmap_size, unsigned int *nr,
@@ -460,6 +498,7 @@ static int read_entries(struct index_state *istate, struct directory_entry **de,
 	conflict_queue = NULL;
 	if (read_conflicts(&conflict_queue, *de, mmap, mmap_size) < 0)
 		return -1;
+	resolve_undo_convert_v5(istate, conflict_queue);
 	for (i = 0; i < (*de)->de_nfiles; i++) {
 		if (read_entry(&ce,
 			       *de,
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 14/19] read-cache: read cache-tree in index-v5
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (12 preceding siblings ...)
  2013-07-12 17:26 ` [PATCH v2 13/19] read-cache: read resolve-undo data Thomas Gummerer
@ 2013-07-12 17:26 ` Thomas Gummerer
  2013-07-12 17:27 ` [PATCH v2 15/19] read-cache: write index-v5 Thomas Gummerer
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:26 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Since the cache-tree data is saved as part of the directory data,
we already read it at the beginning of the index. The cache-tree
is only converted from this directory data.

The cache-tree data is arranged in a tree, with the children sorted by
pathlen at each node, while the ondisk format is sorted lexically.
So we have to rebuild this format from the on-disk directory list.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 cache-tree.c    |   2 +-
 cache-tree.h    |   6 ++++
 read-cache-v5.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 37e4d00..f4b0917 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -31,7 +31,7 @@ void cache_tree_free(struct cache_tree **it_p)
 	*it_p = NULL;
 }
 
-static int subtree_name_cmp(const char *one, int onelen,
+int subtree_name_cmp(const char *one, int onelen,
 			    const char *two, int twolen)
 {
 	if (onelen < twolen)
diff --git a/cache-tree.h b/cache-tree.h
index 55d0f59..9aac493 100644
--- a/cache-tree.h
+++ b/cache-tree.h
@@ -21,10 +21,16 @@ struct cache_tree {
 	struct cache_tree_sub **down;
 };
 
+struct directory_queue {
+	struct directory_queue *down;
+	struct directory_entry *de;
+};
+
 struct cache_tree *cache_tree(void);
 void cache_tree_free(struct cache_tree **);
 void cache_tree_invalidate_path(struct cache_tree *, const char *);
 struct cache_tree_sub *cache_tree_sub(struct cache_tree *, const char *);
+int subtree_name_cmp(const char *, int, const char *, int);
 
 void cache_tree_write(struct strbuf *, struct cache_tree *root);
 struct cache_tree *cache_tree_read(const char *buffer, unsigned long size);
diff --git a/read-cache-v5.c b/read-cache-v5.c
index 853b97d..0b9c320 100644
--- a/read-cache-v5.c
+++ b/read-cache-v5.c
@@ -448,6 +448,103 @@ static int read_conflicts(struct conflict_entry **head,
 	return 0;
 }
 
+static struct cache_tree *convert_one(struct directory_queue *queue, int dirnr)
+{
+	int i, subtree_nr;
+	struct cache_tree *it;
+	struct directory_queue *down;
+
+	it = cache_tree();
+	it->entry_count = queue[dirnr].de->de_nentries;
+	subtree_nr = queue[dirnr].de->de_nsubtrees;
+	if (0 <= it->entry_count)
+		hashcpy(it->sha1, queue[dirnr].de->sha1);
+
+	/*
+	 * Just a heuristic -- we do not add directories that often but
+	 * we do not want to have to extend it immediately when we do,
+	 * hence +2.
+	 */
+	it->subtree_alloc = subtree_nr + 2;
+	it->down = xcalloc(it->subtree_alloc, sizeof(struct cache_tree_sub *));
+	down = queue[dirnr].down;
+	for (i = 0; i < subtree_nr; i++) {
+		struct cache_tree *sub;
+		struct cache_tree_sub *subtree;
+		char *buf, *name;
+
+		name = "";
+		buf = strtok(down[i].de->pathname, "/");
+		while (buf) {
+			name = buf;
+			buf = strtok(NULL, "/");
+		}
+		sub = convert_one(down, i);
+		if(!sub)
+			goto free_return;
+		subtree = cache_tree_sub(it, name);
+		subtree->cache_tree = sub;
+	}
+	if (subtree_nr != it->subtree_nr)
+		die("cache-tree: internal error");
+	return it;
+ free_return:
+	cache_tree_free(&it);
+	return NULL;
+}
+
+static int compare_cache_tree_elements(const void *a, const void *b)
+{
+	const struct directory_entry *de1, *de2;
+
+	de1 = ((const struct directory_queue *)a)->de;
+	de2 = ((const struct directory_queue *)b)->de;
+	return subtree_name_cmp(de1->pathname, de1->de_pathlen,
+				de2->pathname, de2->de_pathlen);
+}
+
+static struct directory_entry *sort_directories(struct directory_entry *de,
+						struct directory_queue *queue)
+{
+	int i, nsubtrees;
+
+	nsubtrees = de->de_nsubtrees;
+	for (i = 0; i < nsubtrees; i++) {
+		struct directory_entry *new_de;
+		de = de->next;
+		new_de = xmalloc(directory_entry_size(de->de_pathlen));
+		memcpy(new_de, de, directory_entry_size(de->de_pathlen));
+		queue[i].de = new_de;
+		if (de->de_nsubtrees) {
+			queue[i].down = xcalloc(de->de_nsubtrees,
+					sizeof(struct directory_queue));
+			de = sort_directories(de,
+					queue[i].down);
+		}
+	}
+	qsort(queue, nsubtrees, sizeof(struct directory_queue),
+			compare_cache_tree_elements);
+	return de;
+}
+
+/*
+ * This function modifies the directory argument that is given to it.
+ * Don't use it if the directory entries are still needed after.
+ */
+static struct cache_tree *cache_tree_convert_v5(struct directory_entry *de)
+{
+	struct directory_queue *queue;
+
+	if (!de->de_nentries)
+		return NULL;
+	queue = xcalloc(1, sizeof(struct directory_queue));
+	queue[0].de = de;
+	queue[0].down = xcalloc(de->de_nsubtrees, sizeof(struct directory_queue));
+
+	sort_directories(de, queue[0].down);
+	return convert_one(queue, 0);
+}
+
 static void resolve_undo_convert_v5(struct index_state *istate,
 				    struct conflict_entry *conflict)
 {
@@ -644,6 +741,7 @@ static int read_index_filtered_v5(struct index_state *istate, void *mmap,
 		} else
 			de = de->next;
 	}
+	istate->cache_tree = cache_tree_convert_v5(root_directory);
 	istate->cache_nr = nr;
 	return 0;
 }
@@ -665,6 +763,8 @@ static int read_index_v5(struct index_state *istate, void *mmap,
 		if (read_entries(istate, &de, &entry_offset, &mmap,
 				 mmap_size, &nr, &foffsetblock) < 0)
 			return -1;
+
+	istate->cache_tree = cache_tree_convert_v5(root_directory);
 	istate->cache_nr = nr;
 	return 0;
 }
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 15/19] read-cache: write index-v5
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (13 preceding siblings ...)
  2013-07-12 17:26 ` [PATCH v2 14/19] read-cache: read cache-tree in index-v5 Thomas Gummerer
@ 2013-07-12 17:27 ` Thomas Gummerer
  2013-07-12 17:27 ` [PATCH v2 16/19] read-cache: write index-v5 cache-tree data Thomas Gummerer
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:27 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Write the index version 5 file format to disk. This version doesn't
write the cache-tree data and resolve-undo data to the file.

The main work is done when filtering out the directories from the
current in-memory format, where in the same turn also the conflicts
and the file data is calculated.

Helped-by: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Helped-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 cache.h         |   8 +
 read-cache-v5.c | 594 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 read-cache.c    |  11 +-
 read-cache.h    |   1 +
 4 files changed, 611 insertions(+), 3 deletions(-)

diff --git a/cache.h b/cache.h
index 1e5cc77..f3685f8 100644
--- a/cache.h
+++ b/cache.h
@@ -565,6 +565,7 @@ extern int unmerged_index(const struct index_state *);
 extern int verify_path(const char *path);
 extern struct cache_entry *index_name_exists(struct index_state *istate, const char *name, int namelen, int igncase);
 extern int index_name_pos(const struct index_state *, const char *name, int namelen);
+extern struct directory_entry *init_directory_entry(char *pathname, int len);
 #define ADD_CACHE_OK_TO_ADD 1		/* Ok to add */
 #define ADD_CACHE_OK_TO_REPLACE 2	/* Ok to replace file/directory */
 #define ADD_CACHE_SKIP_DFCHECK 4	/* Ok to skip DF conflict checks */
@@ -1363,6 +1364,13 @@ static inline ssize_t write_str_in_full(int fd, const char *str)
 	return write_in_full(fd, str, strlen(str));
 }
 
+/* index-v5 helper functions */
+extern char *super_directory(const char *filename);
+extern void insert_directory_entry(struct directory_entry *, struct hash_table *, int *, unsigned int *, uint32_t);
+extern void add_conflict_to_directory_entry(struct directory_entry *, struct conflict_entry *);
+extern void add_part_to_conflict_entry(struct directory_entry *, struct conflict_entry *, struct conflict_part *);
+extern struct conflict_entry *create_new_conflict(char *, int, int);
+
 /* pager.c */
 extern void setup_pager(void);
 extern const char *pager_program;
diff --git a/read-cache-v5.c b/read-cache-v5.c
index 0b9c320..33667d7 100644
--- a/read-cache-v5.c
+++ b/read-cache-v5.c
@@ -769,9 +769,601 @@ static int read_index_v5(struct index_state *istate, void *mmap,
 	return 0;
 }
 
+#define WRITE_BUFFER_SIZE 8192
+static unsigned char write_buffer[WRITE_BUFFER_SIZE];
+static unsigned long write_buffer_len;
+
+static int ce_write_flush(int fd)
+{
+	unsigned int buffered = write_buffer_len;
+	if (buffered) {
+		if (write_in_full(fd, write_buffer, buffered) != buffered)
+			return -1;
+		write_buffer_len = 0;
+	}
+	return 0;
+}
+
+static int ce_write(uint32_t *crc, int fd, void *data, unsigned int len)
+{
+	if (crc)
+		*crc = crc32(*crc, (Bytef*)data, len);
+	while (len) {
+		unsigned int buffered = write_buffer_len;
+		unsigned int partial = WRITE_BUFFER_SIZE - buffered;
+		if (partial > len)
+			partial = len;
+		memcpy(write_buffer + buffered, data, partial);
+		buffered += partial;
+		if (buffered == WRITE_BUFFER_SIZE) {
+			write_buffer_len = buffered;
+			if (ce_write_flush(fd))
+				return -1;
+			buffered = 0;
+		}
+		write_buffer_len = buffered;
+		len -= partial;
+		data = (char *) data + partial;
+	}
+	return 0;
+}
+
+static int ce_flush(int fd)
+{
+	unsigned int left = write_buffer_len;
+
+	if (left)
+		write_buffer_len = 0;
+
+	if (write_in_full(fd, write_buffer, left) != left)
+		return -1;
+
+	return 0;
+}
+
+static void ce_smudge_racily_clean_entry(struct cache_entry *ce)
+{
+	/*
+	 * This method shall only be called if the timestamp of ce
+	 * is racy (check with is_racy_timestamp). If the timestamp
+	 * is racy, the writer will set the CE_SMUDGED flag.
+	 *
+	 * The reader (match_stat_basic) will then take care
+	 * of checking if the entry is really changed or not, by
+	 * taking into account the size and the stat_crc and if
+	 * that hasn't changed checking the sha1.
+	 */
+	ce->ce_flags |= CE_SMUDGED;
+}
+
+char *super_directory(const char *filename)
+{
+	char *slash;
+
+	slash = strrchr(filename, '/');
+	if (slash)
+		return xmemdupz(filename, slash-filename);
+	return NULL;
+}
+
+struct directory_entry *init_directory_entry(char *pathname, int len)
+{
+	struct directory_entry *de = xmalloc(directory_entry_size(len));
+
+	memcpy(de->pathname, pathname, len);
+	de->pathname[len] = '\0';
+	de->de_flags      = 0;
+	de->de_foffset    = 0;
+	de->de_cr         = 0;
+	de->de_ncr        = 0;
+	de->de_nsubtrees  = 0;
+	de->de_nfiles     = 0;
+	de->de_nentries   = 0;
+	memset(de->sha1, 0, 20);
+	de->de_pathlen    = len;
+	de->next          = NULL;
+	de->next_hash     = NULL;
+	de->ce            = NULL;
+	de->ce_last       = NULL;
+	de->conflict      = NULL;
+	de->conflict_last = NULL;
+	de->conflict_size = 0;
+	return de;
+}
+
+static void ondisk_from_directory_entry(struct directory_entry *de,
+					struct ondisk_directory_entry *ondisk)
+{
+	ondisk->foffset   = htonl(de->de_foffset);
+	ondisk->cr        = htonl(de->de_cr);
+	ondisk->ncr       = htonl(de->de_ncr);
+	ondisk->nsubtrees = htonl(de->de_nsubtrees);
+	ondisk->nfiles    = htonl(de->de_nfiles);
+	ondisk->nentries  = htonl(de->de_nentries);
+	hashcpy(ondisk->sha1, de->sha1);
+	ondisk->flags     = htons(de->de_flags);
+}
+
+static struct conflict_part *conflict_part_from_inmemory(struct cache_entry *ce)
+{
+	struct conflict_part *conflict;
+	int flags;
+
+	conflict = xmalloc(sizeof(struct conflict_part));
+	flags                = CONFLICT_CONFLICTED;
+	flags               |= ce_stage(ce) << CONFLICT_STAGESHIFT;
+	conflict->flags      = flags;
+	conflict->entry_mode = ce->ce_mode;
+	conflict->next       = NULL;
+	hashcpy(conflict->sha1, ce->sha1);
+	return conflict;
+}
+
+static void conflict_to_ondisk(struct conflict_part *cp,
+				struct ondisk_conflict_part *ondisk)
+{
+	ondisk->flags      = htons(cp->flags);
+	ondisk->entry_mode = htons(cp->entry_mode);
+	hashcpy(ondisk->sha1, cp->sha1);
+}
+
+void add_conflict_to_directory_entry(struct directory_entry *de,
+					struct conflict_entry *conflict_entry)
+{
+	de->de_ncr++;
+	de->conflict_size += conflict_entry->namelen + 1 + 8 - conflict_entry->pathlen;
+	conflict_entry_push(&de->conflict, &de->conflict_last, conflict_entry);
+}
+
+void insert_directory_entry(struct directory_entry *de,
+			struct hash_table *table,
+			int *total_dir_len,
+			unsigned int *ndir,
+			uint32_t crc)
+{
+	struct directory_entry *insert;
+
+	insert = (struct directory_entry *)insert_hash(crc, de, table);
+	if (insert) {
+		de->next_hash = insert->next_hash;
+		insert->next_hash = de;
+	}
+	(*ndir)++;
+	if (de->de_pathlen == 0)
+		(*total_dir_len)++;
+	else
+		*total_dir_len += de->de_pathlen + 2;
+}
+
+static struct conflict_entry *create_conflict_entry_from_ce(struct cache_entry *ce,
+								int pathlen)
+{
+	return create_new_conflict(ce->name, ce_namelen(ce), pathlen);
+}
+
+static struct directory_entry *compile_directory_data(struct index_state *istate,
+						int nfile,
+						unsigned int *ndir,
+						int *non_conflicted,
+						int *total_dir_len,
+						int *total_file_len)
+{
+	int i, dir_len = -1;
+	char *dir;
+	struct directory_entry *de, *current, *search, *found, *new, *previous_entry;
+	struct cache_entry **cache = istate->cache;
+	struct conflict_entry *conflict_entry;
+	struct hash_table table;
+	uint32_t crc;
+
+	init_hash(&table);
+	de = init_directory_entry("", 0);
+	current = de;
+	*ndir = 1;
+	*total_dir_len = 1;
+	crc = crc32(0, (Bytef*)de->pathname, de->de_pathlen);
+	insert_hash(crc, de, &table);
+	conflict_entry = NULL;
+	for (i = 0; i < nfile; i++) {
+		int new_entry;
+		if (cache[i]->ce_flags & CE_REMOVE)
+			continue;
+
+		new_entry = !ce_stage(cache[i]) || !conflict_entry
+		    || cache_name_compare(conflict_entry->name, conflict_entry->namelen,
+					cache[i]->name, ce_namelen(cache[i]));
+		if (new_entry)
+			(*non_conflicted)++;
+		if (dir_len < 0 || strncmp(cache[i]->name, dir, dir_len)
+		    || cache[i]->name[dir_len] != '/'
+		    || strchr(cache[i]->name + dir_len + 1, '/')) {
+			dir = super_directory(cache[i]->name);
+			if (!dir)
+				dir_len = 0;
+			else
+				dir_len = strlen(dir);
+			crc = crc32(0, (Bytef*)dir, dir_len);
+			found = lookup_hash(crc, &table);
+			search = found;
+			while (search && dir_len != 0 && strcmp(dir, search->pathname) != 0)
+				search = search->next_hash;
+		}
+		previous_entry = current;
+		if (!search || !found) {
+			new = init_directory_entry(dir, dir_len);
+			current->next = new;
+			current = current->next;
+			insert_directory_entry(new, &table, total_dir_len, ndir, crc);
+			search = current;
+		}
+		if (new_entry) {
+			search->de_nfiles++;
+			*total_file_len += ce_namelen(cache[i]) + 1;
+			if (search->de_pathlen)
+				*total_file_len -= search->de_pathlen + 1;
+			ce_queue_push(&(search->ce), &(search->ce_last), cache[i]);
+		}
+		if (ce_stage(cache[i]) > 0) {
+			struct conflict_part *conflict_part;
+			if (new_entry) {
+				conflict_entry = create_conflict_entry_from_ce(cache[i], search->de_pathlen);
+				add_conflict_to_directory_entry(search, conflict_entry);
+			}
+			conflict_part = conflict_part_from_inmemory(cache[i]);
+			add_part_to_conflict_entry(search, conflict_entry, conflict_part);
+		}
+		if (dir && !found) {
+			struct directory_entry *no_subtrees;
+
+			no_subtrees = current;
+			dir = super_directory(dir);
+			if (dir)
+				dir_len = strlen(dir);
+			else
+				dir_len = 0;
+			crc = crc32(0, (Bytef*)dir, dir_len);
+			found = lookup_hash(crc, &table);
+			while (!found) {
+				new = init_directory_entry(dir, dir_len);
+				new->de_nsubtrees = 1;
+				new->next = no_subtrees;
+				no_subtrees = new;
+				insert_directory_entry(new, &table, total_dir_len, ndir, crc);
+				dir = super_directory(dir);
+				if (!dir)
+					dir_len = 0;
+				else
+					dir_len = strlen(dir);
+				crc = crc32(0, (Bytef*)dir, dir_len);
+				found = lookup_hash(crc, &table);
+			}
+			search = found;
+			while (search->next_hash && strcmp(dir, search->pathname) != 0)
+				search = search->next_hash;
+			if (search)
+				found = search;
+			found->de_nsubtrees++;
+			previous_entry->next = no_subtrees;
+		}
+	}
+	return de;
+}
+
+static void ondisk_from_cache_entry(struct cache_entry *ce,
+				    struct ondisk_cache_entry *ondisk)
+{
+	unsigned int flags;
+
+	flags  = ce->ce_flags & CE_STAGEMASK;
+	flags |= ce->ce_flags & CE_VALID;
+	flags |= ce->ce_flags & CE_SMUDGED;
+	if (ce->ce_flags & CE_INTENT_TO_ADD)
+		flags |= CE_INTENT_TO_ADD_V5;
+	if (ce->ce_flags & CE_SKIP_WORKTREE)
+		flags |= CE_SKIP_WORKTREE_V5;
+	ondisk->flags      = htons(flags);
+	ondisk->mode       = htons(ce->ce_mode);
+	ondisk->mtime.sec  = htonl(ce->ce_stat_data.sd_mtime.sec);
+#ifdef USE_NSEC
+	ondisk->mtime.nsec = htonl(ce->ce_stat_data.sd_mtime.nsec);
+#else
+	ondisk->mtime.nsec = 0;
+#endif
+	ondisk->size       = htonl(ce->ce_stat_data.sd_size);
+	if (!ce->ce_stat_crc)
+		ce->ce_stat_crc = calculate_stat_crc(ce);
+	ondisk->stat_crc   = htonl(ce->ce_stat_crc);
+	hashcpy(ondisk->sha1, ce->sha1);
+}
+
+static int write_directories(struct directory_entry *de, int fd, int conflict_offset)
+{
+	struct directory_entry *current;
+	struct ondisk_directory_entry ondisk;
+	int current_offset, offset_write, ondisk_size, foffset;
+	uint32_t crc;
+
+	/*
+	 * This is needed because the compiler aligns structs to sizes multiple
+	 * of 4
+	 */
+	ondisk_size = sizeof(ondisk.flags)
+		+ sizeof(ondisk.foffset)
+		+ sizeof(ondisk.cr)
+		+ sizeof(ondisk.ncr)
+		+ sizeof(ondisk.nsubtrees)
+		+ sizeof(ondisk.nfiles)
+		+ sizeof(ondisk.nentries)
+		+ sizeof(ondisk.sha1);
+	current = de;
+	current_offset = 0;
+	foffset = 0;
+	while (current) {
+		int pathlen;
+
+		offset_write = htonl(current_offset);
+		if (ce_write(NULL, fd, &offset_write, 4) < 0)
+			return -1;
+		if (current->de_pathlen == 0)
+			pathlen = 0;
+		else
+			pathlen = current->de_pathlen + 1;
+		current_offset += pathlen + 1 + ondisk_size + 4;
+		current = current->next;
+	}
+	/*
+	 * Write one more offset, which points to the end of the entries,
+	 * because we use it for calculating the dir length, instead of
+	 * using strlen.
+	 */
+	offset_write = htonl(current_offset);
+	if (ce_write(NULL, fd, &offset_write, 4) < 0)
+		return -1;
+	current = de;
+	while (current) {
+		crc = 0;
+		if (current->de_pathlen == 0) {
+			if (ce_write(&crc, fd, current->pathname, 1) < 0)
+				return -1;
+		} else {
+			char *path;
+			path = xmalloc(sizeof(char) * (current->de_pathlen + 2));
+			memcpy(path, current->pathname, current->de_pathlen);
+			memcpy(path + current->de_pathlen, "/\0", 2);
+			if (ce_write(&crc, fd, path, current->de_pathlen + 2) < 0)
+				return -1;
+		}
+		current->de_foffset = foffset;
+		current->de_cr = conflict_offset;
+		ondisk_from_directory_entry(current, &ondisk);
+		if (ce_write(&crc, fd, &ondisk, ondisk_size) < 0)
+			return -1;
+		crc = htonl(crc);
+		if (ce_write(NULL, fd, &crc, 4) < 0)
+			return -1;
+		conflict_offset += current->conflict_size;
+		foffset += current->de_nfiles * 4;
+		current = current->next;
+	}
+	return 0;
+}
+
+static int write_entries(struct index_state *istate,
+			    struct directory_entry *de,
+			    int entries,
+			    int fd,
+			    int offset_to_offset)
+{
+	int offset, offset_write, ondisk_size;
+	struct directory_entry *current;
+
+	offset = 0;
+	ondisk_size = sizeof(struct ondisk_cache_entry);
+	current = de;
+	while (current) {
+		int pathlen;
+		struct cache_entry *ce = current->ce;
+
+		if (current->de_pathlen == 0)
+			pathlen = 0;
+		else
+			pathlen = current->de_pathlen + 1;
+		while (ce) {
+			if (ce->ce_flags & CE_REMOVE)
+				continue;
+			if (!ce_uptodate(ce) && is_racy_timestamp(istate, ce))
+				ce_smudge_racily_clean_entry(ce);
+			if (is_null_sha1(ce->sha1))
+				return error("cache entry has null sha1: %s", ce->name);
+
+			offset_write = htonl(offset);
+			if (ce_write(NULL, fd, &offset_write, 4) < 0)
+				return -1;
+			offset += ce_namelen(ce) - pathlen + 1 + ondisk_size + 4;
+			ce = ce->next;
+		}
+		current = current->next;
+	}
+	/*
+	 * Write one more offset, which points to the end of the entries,
+	 * because we use it for calculating the file length, instead of
+	 * using strlen.
+	 */
+	offset_write = htonl(offset);
+	if (ce_write(NULL, fd, &offset_write, 4) < 0)
+		return -1;
+
+	offset = offset_to_offset;
+	current = de;
+	while (current) {
+		int pathlen;
+		struct cache_entry *ce = current->ce;
+
+		if (current->de_pathlen == 0)
+			pathlen = 0;
+		else
+			pathlen = current->de_pathlen + 1;
+		while (ce) {
+			struct ondisk_cache_entry ondisk;
+			uint32_t crc, calc_crc;
+
+			if (ce->ce_flags & CE_REMOVE)
+				continue;
+			calc_crc = htonl(offset);
+			crc = crc32(0, (Bytef*)&calc_crc, 4);
+			if (ce_write(&crc, fd, ce->name + pathlen,
+					ce_namelen(ce) - pathlen + 1) < 0)
+				return -1;
+			ondisk_from_cache_entry(ce, &ondisk);
+			if (ce_write(&crc, fd, &ondisk, ondisk_size) < 0)
+				return -1;
+			crc = htonl(crc);
+			if (ce_write(NULL, fd, &crc, 4) < 0)
+				return -1;
+			offset += 4;
+			ce = ce->next;
+		}
+		current = current->next;
+	}
+	return 0;
+}
+
+static int write_conflict(struct conflict_entry *conflict, int fd)
+{
+	struct conflict_entry *current;
+	struct conflict_part *current_part;
+	uint32_t crc;
+
+	current = conflict;
+	while (current) {
+		unsigned int to_write;
+
+		crc = 0;
+		if (ce_write(&crc, fd,
+		     (Bytef*)(current->name + current->pathlen),
+		     current->namelen - current->pathlen) < 0)
+			return -1;
+		if (ce_write(&crc, fd, (Bytef*)"\0", 1) < 0)
+			return -1;
+		to_write = htonl(current->nfileconflicts);
+		if (ce_write(&crc, fd, (Bytef*)&to_write, 4) < 0)
+			return -1;
+		current_part = current->entries;
+		while (current_part) {
+			struct ondisk_conflict_part ondisk;
+
+			conflict_to_ondisk(current_part, &ondisk);
+			if (ce_write(&crc, fd, (Bytef*)&ondisk, sizeof(struct ondisk_conflict_part)) < 0)
+				return 0;
+			current_part = current_part->next;
+		}
+		to_write = htonl(crc);
+		if (ce_write(NULL, fd, (Bytef*)&to_write, 4) < 0)
+			return -1;
+		current = current->next;
+	}
+	return 0;
+}
+
+static int write_conflicts(struct index_state *istate,
+			      struct directory_entry *de,
+			      int fd)
+{
+	struct directory_entry *current;
+
+	current = de;
+	while (current) {
+		if (current->de_ncr != 0) {
+			if (write_conflict(current->conflict, fd) < 0)
+				return -1;
+		}
+		current = current->next;
+	}
+	return 0;
+}
+
+static int write_index_v5(struct index_state *istate, int newfd)
+{
+	struct cache_version_header hdr;
+	struct cache_header hdr_v5;
+	struct cache_entry **cache = istate->cache;
+	struct directory_entry *de;
+	struct ondisk_directory_entry *ondisk;
+	int entries = istate->cache_nr;
+	int i, removed, non_conflicted, total_dir_len, ondisk_directory_size;
+	int total_file_len, conflict_offset, offset_to_offset;
+	unsigned int ndir;
+	uint32_t crc;
+
+	if (istate->filter_opts)
+		die("BUG: index: cannot write a partially read index");
+
+	for (i = removed = 0; i < entries; i++) {
+		if (cache[i]->ce_flags & CE_REMOVE)
+			removed++;
+	}
+	hdr.hdr_signature = htonl(CACHE_SIGNATURE);
+	hdr.hdr_version = htonl(istate->version);
+	hdr_v5.hdr_nfile = htonl(entries - removed);
+	hdr_v5.hdr_nextension = htonl(0); /* Currently no extensions are supported */
+
+	non_conflicted = 0;
+	total_dir_len = 0;
+	total_file_len = 0;
+	de = compile_directory_data(istate, entries, &ndir, &non_conflicted,
+			&total_dir_len, &total_file_len);
+	hdr_v5.hdr_ndir = htonl(ndir);
+
+	/*
+	 * This is needed because the compiler aligns structs to sizes multipe
+	 * of 4
+	 */
+	ondisk_directory_size = sizeof(ondisk->flags)
+		+ sizeof(ondisk->foffset)
+		+ sizeof(ondisk->cr)
+		+ sizeof(ondisk->ncr)
+		+ sizeof(ondisk->nsubtrees)
+		+ sizeof(ondisk->nfiles)
+		+ sizeof(ondisk->nentries)
+		+ sizeof(ondisk->sha1);
+	hdr_v5.hdr_fblockoffset = htonl(sizeof(hdr) + sizeof(hdr_v5) + 4
+		+ (ndir + 1) * 4
+		+ total_dir_len
+		+ ndir * (ondisk_directory_size + 4)
+		+ (non_conflicted + 1) * 4);
+
+	crc = 0;
+	if (ce_write(&crc, newfd, &hdr, sizeof(hdr)) < 0)
+		return -1;
+	if (ce_write(&crc, newfd, &hdr_v5, sizeof(hdr_v5)) < 0)
+		return -1;
+	crc = htonl(crc);
+	if (ce_write(NULL, newfd, &crc, 4) < 0)
+		return -1;
+
+	conflict_offset = sizeof(hdr) + sizeof(hdr_v5) + 4
+		+ (ndir + 1) * 4
+		+ total_dir_len
+		+ ndir * (ondisk_directory_size + 4)
+		+ (non_conflicted + 1) * 4
+		+ total_file_len
+		+ non_conflicted * (sizeof(struct ondisk_cache_entry) + 4);
+	if (write_directories(de, newfd, conflict_offset) < 0)
+		return -1;
+	offset_to_offset = sizeof(hdr) + sizeof(hdr_v5) + 4
+		+ (ndir + 1) * 4
+		+ total_dir_len
+		+ ndir * (ondisk_directory_size + 4);
+	if (write_entries(istate, de, entries, newfd, offset_to_offset) < 0)
+		return -1;
+	if (write_conflicts(istate, de, newfd) < 0)
+		return -1;
+	return ce_flush(newfd);
+}
+
 struct index_ops v5_ops = {
 	match_stat_basic,
 	verify_hdr,
 	read_index_v5,
-	NULL
+	write_index_v5
 };
diff --git a/read-cache.c b/read-cache.c
index 9bfbb4f..2fcca7f 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -108,7 +108,7 @@ int match_stat_data(const struct stat_data *sd, struct stat *st)
 	return changed;
 }
 
-static uint32_t calculate_stat_crc(struct cache_entry *ce)
+uint32_t calculate_stat_crc(struct cache_entry *ce)
 {
 	unsigned int ctimens = 0;
 	uint32_t stat, stat_crc;
@@ -232,6 +232,8 @@ static void set_istate_ops(struct index_state *istate)
 
 	if (istate->version >= 2 && istate->version <= 4)
 		istate->ops = &v2_ops;
+	if (istate->version == 5)
+		istate->ops = &v5_ops;
 }
 
 int ce_match_stat_basic(struct index_state *istate,
@@ -1300,7 +1302,12 @@ static int verify_hdr_version(struct index_state *istate,
 	hdr_version = ntohl(hdr->hdr_version);
 	if (hdr_version < INDEX_FORMAT_LB || INDEX_FORMAT_UB < hdr_version)
 		return error("bad index version %d", hdr_version);
-	istate->ops = &v2_ops;
+
+	if (hdr_version >= 2 && hdr_version <= 4)
+		istate->ops = &v2_ops;
+	else if (hdr_version == 5)
+		istate->ops = &v5_ops;
+
 	return 0;
 }
 
diff --git a/read-cache.h b/read-cache.h
index 869c562..0cb3b8b 100644
--- a/read-cache.h
+++ b/read-cache.h
@@ -62,3 +62,4 @@ extern int ce_match_stat_basic(struct index_state *istate, const struct cache_en
 			       struct stat *st);
 extern int is_racy_timestamp(const struct index_state *istate, const struct cache_entry *ce);
 extern void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce);
+extern uint32_t calculate_stat_crc(struct cache_entry *ce);
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 16/19] read-cache: write index-v5 cache-tree data
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (14 preceding siblings ...)
  2013-07-12 17:27 ` [PATCH v2 15/19] read-cache: write index-v5 Thomas Gummerer
@ 2013-07-12 17:27 ` Thomas Gummerer
  2013-07-12 17:27 ` [PATCH v2 17/19] read-cache: write resolve-undo data for index-v5 Thomas Gummerer
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:27 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Write the cache-tree data for the index version 5 file format. The
in-memory cache-tree data is converted to the ondisk format, by adding
it to the directory entries, that were compiled from the cache-entries
in the step before.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 read-cache-v5.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/read-cache-v5.c b/read-cache-v5.c
index 33667d7..cd819b4 100644
--- a/read-cache-v5.c
+++ b/read-cache-v5.c
@@ -941,6 +941,57 @@ static struct conflict_entry *create_conflict_entry_from_ce(struct cache_entry *
 	return create_new_conflict(ce->name, ce_namelen(ce), pathlen);
 }
 
+static void convert_one_to_ondisk_v5(struct hash_table *table, struct cache_tree *it,
+				const char *path, int pathlen, uint32_t crc)
+{
+	int i;
+	struct directory_entry *found, *search;
+
+	crc = crc32(crc, (Bytef*)path, pathlen);
+	found = lookup_hash(crc, table);
+	search = found;
+	while (search && strcmp(path, search->pathname + search->de_pathlen - strlen(path)) != 0)
+		search = search->next_hash;
+	if (!search)
+		return;
+	/*
+	 * The number of subtrees is already calculated by
+	 * compile_directory_data, therefore we only need to
+	 * add the entry_count
+	 */
+	search->de_nentries = it->entry_count;
+	if (0 <= it->entry_count)
+		hashcpy(search->sha1, it->sha1);
+	if (strcmp(path, "") != 0)
+		crc = crc32(crc, (Bytef*)"/", 1);
+
+#if DEBUG
+	if (0 <= it->entry_count)
+		fprintf(stderr, "cache-tree <%.*s> (%d ent, %d subtree) %s\n",
+			pathlen, path, it->entry_count, it->subtree_nr,
+			sha1_to_hex(it->sha1));
+	else
+		fprintf(stderr, "cache-tree <%.*s> (%d subtree) invalid\n",
+			pathlen, path, it->subtree_nr);
+#endif
+
+	for (i = 0; i < it->subtree_nr; i++) {
+		struct cache_tree_sub *down = it->down[i];
+		if (i) {
+			struct cache_tree_sub *prev = it->down[i-1];
+			if (subtree_name_cmp(down->name, down->namelen,
+					     prev->name, prev->namelen) <= 0)
+				die("fatal - unsorted cache subtree");
+		}
+		convert_one_to_ondisk_v5(table, down->cache_tree, down->name, down->namelen, crc);
+	}
+}
+
+static void cache_tree_to_ondisk_v5(struct hash_table *table, struct cache_tree *root)
+{
+	convert_one_to_ondisk_v5(table, root, "", 0, 0);
+}
+
 static struct directory_entry *compile_directory_data(struct index_state *istate,
 						int nfile,
 						unsigned int *ndir,
@@ -1046,6 +1097,8 @@ static struct directory_entry *compile_directory_data(struct index_state *istate
 			previous_entry->next = no_subtrees;
 		}
 	}
+	if (istate->cache_tree)
+		cache_tree_to_ondisk_v5(&table, istate->cache_tree);
 	return de;
 }
 
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 17/19] read-cache: write resolve-undo data for index-v5
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (15 preceding siblings ...)
  2013-07-12 17:27 ` [PATCH v2 16/19] read-cache: write index-v5 cache-tree data Thomas Gummerer
@ 2013-07-12 17:27 ` Thomas Gummerer
  2013-07-12 17:27 ` [PATCH v2 18/19] update-index.c: rewrite index when index-version is given Thomas Gummerer
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:27 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Make git read the resolve-undo data from the index.

Since the resolve-undo data is joined with the conflicts in
the ondisk format of the index file version 5, conflicts and
resolved data is read at the same time, and the resolve-undo
data is then converted to the in-memory format.

Helped-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 read-cache-v5.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)

diff --git a/read-cache-v5.c b/read-cache-v5.c
index cd819b4..093ee1a 100644
--- a/read-cache-v5.c
+++ b/read-cache-v5.c
@@ -992,6 +992,99 @@ static void cache_tree_to_ondisk_v5(struct hash_table *table, struct cache_tree
 	convert_one_to_ondisk_v5(table, root, "", 0, 0);
 }
 
+static void resolve_undo_to_ondisk_v5(struct hash_table *table,
+				      struct string_list *resolve_undo,
+				      unsigned int *ndir, int *total_dir_len,
+				      struct directory_entry *de)
+{
+	struct string_list_item *item;
+	struct directory_entry *search;
+
+	if (!resolve_undo)
+		return;
+	for_each_string_list_item(item, resolve_undo) {
+		struct conflict_entry *conflict_entry;
+		struct resolve_undo_info *ui = item->util;
+		char *super;
+		int i, dir_len, len;
+		uint32_t crc;
+		struct directory_entry *found, *current, *new_tree;
+
+		if (!ui)
+			continue;
+
+		super = super_directory(item->string);
+		if (!super)
+			dir_len = 0;
+		else
+			dir_len = strlen(super);
+		crc = crc32(0, (Bytef*)super, dir_len);
+		found = lookup_hash(crc, table);
+		current = NULL;
+		new_tree = NULL;
+
+		while (!found) {
+			struct directory_entry *new;
+
+			new = init_directory_entry(super, dir_len);
+			if (!current)
+				current = new;
+			insert_directory_entry(new, table, total_dir_len, ndir, crc);
+			if (new_tree != NULL)
+				new->de_nsubtrees = 1;
+			new->next = new_tree;
+			new_tree = new;
+			super = super_directory(super);
+			if (!super)
+				dir_len = 0;
+			else
+				dir_len = strlen(super);
+			crc = crc32(0, (Bytef*)super, dir_len);
+			found = lookup_hash(crc, table);
+		}
+		search = found;
+		while (search->next_hash && strcmp(super, search->pathname) != 0)
+			search = search->next_hash;
+		if (search && !current)
+			current = search;
+		if (!search && !current)
+			current = new_tree;
+		if (!super && new_tree) {
+			new_tree->next = de->next;
+			de->next = new_tree;
+			de->de_nsubtrees++;
+		} else if (new_tree) {
+			struct directory_entry *temp;
+
+			search = de->next;
+			while (strcmp(super, search->pathname))
+				search = search->next;
+			temp = new_tree;
+			while (temp->next)
+				temp = temp->next;
+			search->de_nsubtrees++;
+			temp->next = search->next;
+			search->next = new_tree;
+		}
+
+		len = strlen(item->string);
+		conflict_entry = create_new_conflict(item->string, len, current->de_pathlen);
+		add_conflict_to_directory_entry(current, conflict_entry);
+		for (i = 0; i < 3; i++) {
+			if (ui->mode[i]) {
+				struct conflict_part *cp;
+
+				cp = xmalloc(sizeof(struct conflict_part));
+				cp->flags = (i + 1) << CONFLICT_STAGESHIFT;
+				cp->entry_mode = ui->mode[i];
+				cp->next = NULL;
+				hashcpy(cp->sha1, ui->sha1[i]);
+				add_part_to_conflict_entry(current, conflict_entry, cp);
+			}
+		}
+	}
+}
+
 static struct directory_entry *compile_directory_data(struct index_state *istate,
 						int nfile,
 						unsigned int *ndir,
@@ -1099,6 +1192,7 @@ static struct directory_entry *compile_directory_data(struct index_state *istate
 	}
 	if (istate->cache_tree)
 		cache_tree_to_ondisk_v5(&table, istate->cache_tree);
+	resolve_undo_to_ondisk_v5(&table, istate->resolve_undo, ndir, total_dir_len, de);
 	return de;
 }
 
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 18/19] update-index.c: rewrite index when index-version is given
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (16 preceding siblings ...)
  2013-07-12 17:27 ` [PATCH v2 17/19] read-cache: write resolve-undo data for index-v5 Thomas Gummerer
@ 2013-07-12 17:27 ` Thomas Gummerer
  2013-07-12 17:27 ` [PATCH v2 19/19] p0003-index.sh: add perf test for the index formats Thomas Gummerer
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:27 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

Make update-index always rewrite the index when a index-version
is given, even if the index already has the right version.
This option is used for performance testing the writer and
reader.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 builtin/update-index.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/builtin/update-index.c b/builtin/update-index.c
index 4c6e3a6..7e723c0 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -6,6 +6,7 @@
 #include "cache.h"
 #include "quote.h"
 #include "cache-tree.h"
+#include "read-cache.h"
 #include "tree-walk.h"
 #include "builtin.h"
 #include "refs.h"
@@ -863,8 +864,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
 			    preferred_index_format,
 			    INDEX_FORMAT_LB, INDEX_FORMAT_UB);
 
-		if (the_index.version != preferred_index_format)
-			active_cache_changed = 1;
+		active_cache_changed = 1;
 		the_index.version = preferred_index_format;
 	}
 
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 19/19] p0003-index.sh: add perf test for the index formats
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (17 preceding siblings ...)
  2013-07-12 17:27 ` [PATCH v2 18/19] update-index.c: rewrite index when index-version is given Thomas Gummerer
@ 2013-07-12 17:27 ` Thomas Gummerer
  2013-07-14  2:59 ` [PATCH v2 00/19] Index-v5 Duy Nguyen
  2013-07-16 21:03 ` Ramsay Jones
  20 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-12 17:27 UTC (permalink / raw)
  To: git; +Cc: t.gummerer, trast, mhagger, gitster, pclouds, robin.rosenberg,
	sunshine

From: Thomas Rast <trast@inf.ethz.ch>

Add a performance test for index version [23]/4/5 by using
git update-index --index-version=x, thus testing both the reader
and the writer speed of all index formats.

Signed-off-by: Thomas Rast <trast@inf.ethz.ch>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 t/perf/p0003-index.sh | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)
 create mode 100755 t/perf/p0003-index.sh

diff --git a/t/perf/p0003-index.sh b/t/perf/p0003-index.sh
new file mode 100755
index 0000000..3e02868
--- /dev/null
+++ b/t/perf/p0003-index.sh
@@ -0,0 +1,59 @@
+#!/bin/sh
+
+test_description="Tests index versions [23]/4/5"
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+test_expect_success "convert to v3" "
+	git update-index --index-version=2
+"
+
+test_perf "v[23]: update-index" "
+	git update-index --index-version=2 >/dev/null
+"
+
+subdir=$(git ls-files | sed 's#/[^/]*$##' | grep -v '^$' | uniq | tail -n 30 | head -1)
+
+test_perf "v[23]: grep nonexistent -- subdir" "
+	test_must_fail git grep nonexistent -- $subdir >/dev/null
+"
+
+test_perf "v[23]: ls-files -- subdir" "
+	git ls-files $subdir >/dev/null
+"
+
+test_expect_success "convert to v4" "
+	git update-index --index-version=4
+"
+
+test_perf "v4: update-index" "
+	git update-index --index-version=4 >/dev/null
+"
+
+test_perf "v4: grep nonexistent -- subdir" "
+	test_must_fail git grep nonexistent -- $subdir >/dev/null
+"
+
+test_perf "v4: ls-files -- subdir" "
+	git ls-files $subdir >/dev/null
+"
+
+test_expect_success "convert to v5" "
+	git update-index --index-version=5
+"
+
+test_perf "v5: update-index" "
+	git update-index --index-version=5 >/dev/null
+"
+
+test_perf "v5: grep nonexistent -- subdir" "
+	test_must_fail git grep nonexistent -- $subdir >/dev/null
+"
+
+test_perf "v5: ls-files -- subdir" "
+	git ls-files $subdir >/dev/null
+"
+
+test_done
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 00/19] Index-v5
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (18 preceding siblings ...)
  2013-07-12 17:27 ` [PATCH v2 19/19] p0003-index.sh: add perf test for the index formats Thomas Gummerer
@ 2013-07-14  2:59 ` Duy Nguyen
  2013-07-15  9:30   ` Thomas Gummerer
  2013-07-16 21:03 ` Ramsay Jones
  20 siblings, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-14  2:59 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>  t/perf/p0003-index.sh                            |   59 +
>  t/t2104-update-index-skip-worktree.sh            |    1 +

For such a big code addition, the test part seems modest. Speaking
from my experience, I rarely run perf tests and "make test" does not
activate v5 code at all. A few more tests would be nice. The good news
is I changed default index version to 5 and ran "make test", all
passed.
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/19] read-cache: move index v2 specific functions to their own file
  2013-07-12 17:26 ` [PATCH v2 03/19] read-cache: move index v2 specific functions to their own file Thomas Gummerer
@ 2013-07-14  3:10   ` Duy Nguyen
  2013-07-19 14:53     ` Thomas Gummerer
  0 siblings, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-14  3:10 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> @@ -489,8 +479,8 @@ extern void *read_blob_data_from_index(struct index_state *, const char *, unsig
>  #define CE_MATCH_RACY_IS_DIRTY         02
>  /* do stat comparison even if CE_SKIP_WORKTREE is true */
>  #define CE_MATCH_IGNORE_SKIP_WORKTREE  04
> -extern int ie_match_stat(const struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
> -extern int ie_modified(const struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
> +extern int ie_match_stat(struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
> +extern int ie_modified(struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
>

I would rather we keep "const struct index_state*" if we could. I
tried putting "const" back and found that ce_match_stat_basic() calls
set_istate_ops(), which writes to "struct index_state". Putting
set_istate_ops() in ce_match_stat_basic() may seem convenient, but
does not make much sense (why would a match_stat function update
index_ops?). I think you could move it out and

 - read_index calls set_istate_ops
 - (bonus) discard_index probably should reset "version" field to zero
and clear internal_ops
 - the callers that use index without read_index must call
initialize_index() or something, which in turn calls set_istate_ops.
initialize_index() may take the preferred index version
 - do not let update-index modifies version field directly when
--index-version is given. Wrap it with set_index_version() or
something, so we can do internal conversion from one version to
another if needed
 - remove set_istate_ops in write_index(), we may need internal_ops
long before writing. When write_index is called, internal_ops should
be already initialized
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 06/19] read-cache: add index reading api
  2013-07-12 17:26 ` [PATCH v2 06/19] read-cache: add index reading api Thomas Gummerer
@ 2013-07-14  3:21   ` Duy Nguyen
  0 siblings, 0 replies; 51+ messages in thread
From: Duy Nguyen @ 2013-07-14  3:21 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> @@ -238,7 +245,6 @@ static int read_index_v2(struct index_state *istate, void *mmap,
>                 disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
>                 ce = create_from_disk(disk_ce, &consumed, previous_name);
>                 set_index_entry(istate, i, ce);
> -
>                 src_offset += consumed;
>         }
>         strbuf_release(&previous_name_buf);

Nit pick. This is unnecessary.

> +int for_each_index_entry_v2(struct index_state *istate, each_cache_entry_fn fn, void *cb_data)
> +{
> +       int i, ret = 0;
> +       struct filter_opts *opts= istate->filter_opts;

Nitpick. space before "=".

> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -996,6 +996,7 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
>                 memmove(istate->cache + pos + 1,
>                         istate->cache + pos,
>                         (istate->cache_nr - pos - 1) * sizeof(ce));
> +
>         set_index_entry(istate, pos, ce);
>         istate->cache_changed = 1;
>         return 0;

NP: unnecessary change.

> +int set_internal_ops(struct index_state *istate)
> +{
> +       if (!istate->internal_ops && istate->cache)
> +               istate->internal_ops = &v2_internal_ops;
> +       if (!istate->internal_ops)
> +               return 0;
> +       return 1;
> +}
> +
> +/*
> + * Execute fn for each index entry which is currently in istate.  Data
> + * can be given to the function using the cb_data parameter.
> + */
> +int for_each_index_entry(struct index_state *istate, each_cache_entry_fn fn, void *cb_data)
> +{
> +       if (!set_internal_ops(istate))
> +               return 0;
> +       return istate->internal_ops->for_each_index_entry(istate, fn, cb_data);
>  }

set_internal_ops should have been called in read_index and initialize_index.

> @@ -1374,6 +1409,7 @@ int discard_index(struct index_state *istate)
>         free(istate->cache);
>         istate->cache = NULL;
>         istate->cache_alloc = 0;
> +       istate->filter_opts = NULL;
>         return 0;
>  }

Why is internal_ops not reset here?

> --- a/read-cache.h
> +++ b/read-cache.h
> @@ -24,11 +24,17 @@ struct cache_version_header {
>  struct index_ops {
>         int (*match_stat_basic)(const struct cache_entry *ce, struct stat *st, int changed);
>         int (*verify_hdr)(void *mmap, unsigned long size);
> -       int (*read_index)(struct index_state *istate, void *mmap, unsigned long mmap_size);
> +       int (*read_index)(struct index_state *istate, void *mmap, unsigned long mmap_size,
> +                         struct filter_opts *opts);
>         int (*write_index)(struct index_state *istate, int newfd);
>  };
>
> +struct internal_ops {
> +       int (*for_each_index_entry)(struct index_state *istate, each_cache_entry_fn fn, void *cb_data);
> +};
> +

I wonder if we do need separate internal_ops from index_ops. Why not
merge internal_ops in index_ops?
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 07/19] make sure partially read index is not changed
  2013-07-12 17:26 ` [PATCH v2 07/19] make sure partially read index is not changed Thomas Gummerer
@ 2013-07-14  3:29   ` Duy Nguyen
  2013-07-17 12:56     ` Thomas Gummerer
  0 siblings, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-14  3:29 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> A partially read index file currently cannot be written to disk.  Make
> sure that never happens, by erroring out when a caller tries to change a
> partially read index.  The caller is responsible for reading the whole
> index when it's trying to change it later.
>
> Forcing the caller to load the right part of the index file instead of
> re-reading it when changing it, gives a bit of a performance advantage,
> by avoiding to read parts of the index twice.

If you want to be strict about this, put similar check in
fill_stat_cache_info (used by entry.c), cache_tree_invalidate_path and
convert the code base (maybe except unpack-trees.c, just put a check
in the beginning of unpack_trees()) to use wrapper function to change
ce_flags (some still update ce_flags directly, grep them). Some flags
are in memory only and should be allowed to change in partial index,
some are to be written on disk and should not be allowed.
-- 
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 08/19] grep.c: Use index api
  2013-07-12 17:26 ` [PATCH v2 08/19] grep.c: Use index api Thomas Gummerer
@ 2013-07-14  3:32   ` Duy Nguyen
  2013-07-15  9:51     ` Thomas Gummerer
  0 siblings, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-14  3:32 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> +static int grep_cache(struct cache_entry *ce, void *cb_data)
>  {
> -       int hit = 0;
> -       int nr;
> -       read_cache();
> +       struct grep_opts *opts = cb_data;
>
> -       for (nr = 0; nr < active_nr; nr++) {
> -               struct cache_entry *ce = active_cache[nr];
> -               if (!S_ISREG(ce->ce_mode))
> -                       continue;
> -               if (!match_pathspec_depth(pathspec, ce->name, ce_namelen(ce), 0, NULL))
> -                       continue;
> -               /*
> -                * If CE_VALID is on, we assume worktree file and its cache entry
> -                * are identical, even if worktree file has been modified, so use
> -                * cache version instead
> -                */
> -               if (cached || (ce->ce_flags & CE_VALID) || ce_skip_worktree(ce)) {
> -                       if (ce_stage(ce))
> -                               continue;
> -                       hit |= grep_sha1(opt, ce->sha1, ce->name, 0, ce->name);
> -               }
> -               else
> -                       hit |= grep_file(opt, ce->name);
> -               if (ce_stage(ce)) {
> -                       do {
> -                               nr++;
> -                       } while (nr < active_nr &&
> -                                !strcmp(ce->name, active_cache[nr]->name));
> -                       nr--; /* compensate for loop control */
> -               }
> -               if (hit && opt->status_only)
> -                       break;
> -       }
> -       return hit;
> +       if (!S_ISREG(ce->ce_mode))
> +               return 0;
> +       if (!match_pathspec_depth(opts->pathspec, ce->name, ce_namelen(ce), 0, NULL))
> +               return 0;

You do a match_pathspec_depth here..

> @@ -895,10 +887,21 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>         } else if (0 <= opt_exclude) {
>                 die(_("--[no-]exclude-standard cannot be used for tracked contents."));
>         } else if (!list.nr) {
> +               struct grep_opts opts;
> +               struct filter_opts *filter_opts = xmalloc(sizeof(*filter_opts));
> +
>                 if (!cached)
>                         setup_work_tree();
>
> -               hit = grep_cache(&opt, &pathspec, cached);
> +               memset(filter_opts, 0, sizeof(*filter_opts));
> +               filter_opts->pathspec = &pathspec;
> +               opts.opt = &opt;
> +               opts.pathspec = &pathspec;
> +               opts.cached = cached;
> +               opts.hit = 0;
> +               read_cache_filtered(filter_opts);
> +               for_each_cache_entry(grep_cache, &opts);

And here again inside for_each_cache_entry. In the worst case that
could turn into 2 expensive fnmatch instead of one. Is this conversion
worth it? Note that match_pathspec is just a deprecated version of
match_pathspec_depth. They basically do the same thing.
-- 
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 09/19] ls-files.c: use index api
  2013-07-12 17:26 ` [PATCH v2 09/19] ls-files.c: use " Thomas Gummerer
@ 2013-07-14  3:39   ` Duy Nguyen
  2013-07-17  8:07     ` Thomas Gummerer
  0 siblings, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-14  3:39 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> +       if (!with_tree) {
> +               memset(opts, 0, sizeof(*opts));
> +               opts->pathspec = &pathspec_struct;
> +               opts->read_staged = 1;
> +               if (show_resolve_undo)
> +                       opts->read_resolve_undo = 1;
> +               read_cache_filtered(opts);

So you load partial index here.

> +       } else {
> +               read_cache();
> +       }
> +       /* be nice with submodule paths ending in a slash */
> +       if (pathspec)
> +               strip_trailing_slash_from_submodules();

Then strip_trailing_slash_from_submodules will attempt to convert
pathspec "foo/" to "foo" if "foo" exists in the index and is a
gitlink. But becaues "foo/" is used to load the partial index, "foo"
is not loaded (is it?) and this could become incorrect no-op. I
suggest you go through the pathspec once checking for ones ending with
'/'. If so strip_trailing_... may potentially update pathspec, just
load full index. If no pathspec ends with '/', strip_trail... is no-op
and we can do partial loading safely.
-- 
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 10/19] documentation: add documentation of the index-v5 file format
  2013-07-12 17:26 ` [PATCH v2 10/19] documentation: add documentation of the index-v5 file format Thomas Gummerer
@ 2013-07-14  3:59   ` Duy Nguyen
  2013-07-17  8:09     ` Thomas Gummerer
  2013-08-04 11:26   ` Duy Nguyen
  1 sibling, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-14  3:59 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> +== Directory offsets (diroffsets)
> +
> +  diroffset (32-bits): offset to the directory relative to the beginning
> +    of the index file. There are ndir + 1 offsets in the diroffset table,
> +    the last is pointing to the end of the last direntry. With this last
> +    entry, we are able to replace the strlen of when reading the directory

strlen of what?

> +    name, by calculating it from diroffset[n+1]-diroffset[n]-61.  61 is the
> +    size of the directory data, which follows each each directory + the
> +    crc sum + the NUL byte.
> +
> +  This part is needed for making the directory entries bisectable and
> +    thus allowing a binary search.

Just thinking out loud. Maybe this section and fileoffsets should be
made optional extensions. So far I see no use for them. It's nice for
a program to occasionally look at a single entry, but such programs do
not exist (yet). For inotify monitor that may want to update a single
file's stat, it could regenerate the index with {dir,file}offsets
extensions the first time it attempts to update the index, then it
could do bsearch.
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/19] read-cache: read index-v5
  2013-07-12 17:26 ` [PATCH v2 12/19] read-cache: read index-v5 Thomas Gummerer
@ 2013-07-14  4:42   ` Duy Nguyen
  2013-08-07  8:13     ` Thomas Gummerer
  2013-07-15 10:12   ` Duy Nguyen
  1 sibling, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-14  4:42 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> +struct directory_entry {
> +       struct directory_entry *next;
> +       struct directory_entry *next_hash;
> +       struct cache_entry *ce;
> +       struct cache_entry *ce_last;
> +       struct conflict_entry *conflict;
> +       struct conflict_entry *conflict_last;
> +       unsigned int conflict_size;
> +       unsigned int de_foffset;
> +       unsigned int de_cr;
> +       unsigned int de_ncr;
> +       unsigned int de_nsubtrees;
> +       unsigned int de_nfiles;
> +       unsigned int de_nentries;
> +       unsigned char sha1[20];
> +       unsigned short de_flags;
> +       unsigned int de_pathlen;
> +       char pathname[FLEX_ARRAY];
> +};
> +
> +struct conflict_part {
> +       struct conflict_part *next;
> +       unsigned short flags;
> +       unsigned short entry_mode;
> +       unsigned char sha1[20];
> +};
> +
> +struct conflict_entry {
> +       struct conflict_entry *next;
> +       unsigned int nfileconflicts;
> +       struct conflict_part *entries;
> +       unsigned int namelen;
> +       unsigned int pathlen;
> +       char name[FLEX_ARRAY];
> +};
> +
> +struct ondisk_conflict_part {
> +       unsigned short flags;
> +       unsigned short entry_mode;
> +       unsigned char sha1[20];
> +};

These new structs should probably be in read-cache-v5.c, or read-cache.h

>  #define cache_entry_size(len) (offsetof(struct cache_entry,name) + (len) + 1)
> +#define directory_entry_size(len) (offsetof(struct directory_entry,pathname) + (len) + 1)
> +#define conflict_entry_size(len) (offsetof(struct conflict_entry,name) + (len) + 1)

These are used by read-cache-v5.c only so far. I'd say move them to
read-cache.h or read-cache-v5.c together with the new structs.

> +struct ondisk_cache_entry {
> +       unsigned short flags;
> +       unsigned short mode;
> +       struct cache_time mtime;
> +       unsigned int size;
> +       int stat_crc;
> +       unsigned char sha1[20];
> +};
> +
> +struct ondisk_directory_entry {
> +       unsigned int foffset;
> +       unsigned int cr;
> +       unsigned int ncr;
> +       unsigned int nsubtrees;
> +       unsigned int nfiles;
> +       unsigned int nentries;
> +       unsigned char sha1[20];
> +       unsigned short flags;
> +};

Perhaps use uint32_t, uint16_t and friends for all on-disk structures?

> +static struct directory_entry *read_directories(unsigned int *dir_offset,
> +                               unsigned int *dir_table_offset,
> +                               void *mmap,
> +                               int mmap_size)
> +{
> +       int i, ondisk_directory_size;
> +       uint32_t *filecrc, *beginning, *end;
> +       struct directory_entry *current = NULL;
> +       struct ondisk_directory_entry *disk_de;
> +       struct directory_entry *de;
> +       unsigned int data_len, len;
> +       char *name;
> +
> +       /*
> +        * Length of pathname + nul byte for termination + size of
> +        * members of ondisk_directory_entry. (Just using the size
> +        * of the struct doesn't work, because there may be padding
> +        * bytes for the struct)
> +        */
> +       ondisk_directory_size = sizeof(disk_de->flags)
> +               + sizeof(disk_de->foffset)
> +               + sizeof(disk_de->cr)
> +               + sizeof(disk_de->ncr)
> +               + sizeof(disk_de->nsubtrees)
> +               + sizeof(disk_de->nfiles)
> +               + sizeof(disk_de->nentries)
> +               + sizeof(disk_de->sha1);
> +       name = ptr_add(mmap, *dir_offset);
> +       beginning = ptr_add(mmap, *dir_table_offset);
> +       end = ptr_add(mmap, *dir_table_offset + 4);
> +       len = ntoh_l(*end) - ntoh_l(*beginning) - ondisk_directory_size - 5;
> +       disk_de = ptr_add(mmap, *dir_offset + len + 1);
> +       de = directory_entry_from_ondisk(disk_de, name, len);
> +       de->next = NULL;
> +
> +       data_len = len + 1 + ondisk_directory_size;
> +       filecrc = ptr_add(mmap, *dir_offset + data_len);
> +       if (!check_crc32(0, ptr_add(mmap, *dir_offset), data_len, ntoh_l(*filecrc)))
> +               goto unmap;
> +
> +       *dir_table_offset += 4;
> +       *dir_offset += data_len + 4; /* crc code */
> +
> +       current = de;
> +       for (i = 0; i < de->de_nsubtrees; i++) {
> +               current->next = read_directories(dir_offset, dir_table_offset,
> +                                               mmap, mmap_size);
> +               while (current->next)
> +                       current = current->next;
> +       }
> +
> +       return de;
> +unmap:
> +       munmap(mmap, mmap_size);
> +       die("directory crc doesn't match for '%s'", de->pathname);
> +}

You don't have to munmap when you die() anway. I'm not sure if flatten
the directory hierarchy into a list (linked by next pointer) is a good
idea, or we should maintain the tree structure in memory. Still a lot
of reading to figure that out..

I skipped from here..

> +static void ce_queue_push(struct cache_entry **head,
> +                            struct cache_entry **tail,
> +                            struct cache_entry *ce)
> +{

...

> +static int read_conflicts(struct conflict_entry **head,
> +                         struct directory_entry *de,
> +                         void **mmap, unsigned long mmap_size)
> +{

till the end of this function. Not interested in conflict stuff yet.


> +static struct directory_entry *read_head_directories(struct index_state *istate,
> +                                                    unsigned int *entry_offset,
> +                                                    unsigned int *foffsetblock,
> +                                                    unsigned int *ndirs,
> +                                                    void *mmap, unsigned long mmap_size)
> +{

Maybe read_all_directories is a better nam.

> +static int read_index_filtered_v5(struct index_state *istate, void *mmap,
> +                                 unsigned long mmap_size, struct filter_opts *opts)
> +{
> +       unsigned int entry_offset, ndirs, foffsetblock, nr = 0;
> +       struct directory_entry *root_directory, *de;
> +       const char **adjusted_pathspec = NULL;
> +       int need_root = 0, i, n;
> +       char *oldpath, *seen;
> +
> +       ...
> +
> +       de = root_directory;
> +       while (de) {
> +               if (need_root ||
> +                   match_pathspec(adjusted_pathspec, de->pathname, de->de_pathlen, 0, NULL)) {
> +                       unsigned int subdir_foffsetblock = de->de_foffset + foffsetblock;
> +                       unsigned int *off = mmap + subdir_foffsetblock;
> +                       unsigned int subdir_entry_offset = entry_offset + ntoh_l(*off);
> +                       oldpath = de->pathname;
> +                       do {
> +                               if (read_entries(istate, &de, &subdir_entry_offset,
> +                                                &mmap, mmap_size, &nr,
> +                                                &subdir_foffsetblock) < 0)
> +                                       return -1;
> +                       } while (de && !prefixcmp(de->pathname, oldpath));
> +               } else
> +                       de = de->next;
> +       }

Hm.. if we maintain a tree structure here (one link to the first
subdir, one link to the next sibling), I think the "do" loop could be
done without prefixcmp. Just check if "de" returned by read_entries is
the next sibling "de" (iow the end of current directory recursively).

> +       istate->cache_nr = nr;
> +       return 0;
> +}
> +
> +static int read_index_v5(struct index_state *istate, void *mmap,
> +                        unsigned long mmap_size, struct filter_opts *opts)
> +{
> +       unsigned int entry_offset, ndirs, foffsetblock, nr = 0;
> +       struct directory_entry *root_directory, *de;
> +
> +       if (opts != NULL)
> +               return read_index_filtered_v5(istate, mmap, mmap_size, opts);
> +
> +       root_directory = read_head_directories(istate, &entry_offset,
> +                                              &foffsetblock, &ndirs,
> +                                              mmap, mmap_size);
> +       de = root_directory;
> +       while (de)
> +               if (read_entries(istate, &de, &entry_offset, &mmap,
> +                                mmap_size, &nr, &foffsetblock) < 0)
> +                       return -1;
> +       istate->cache_nr = nr;
> +       return 0;
> +}

Make it call read_index_filtered_v5 with an empty pathspec instead.
match_pathspec* returns true immediately if pathspec is empty. Without
the removal of prefixcmp() in the "do" loop mentioned above,
read_index_filtered_v5 can't be more expensive than this version.

That was it! Lunch time! Maybe I'll read the rest in the afternoon, or
someday next week.
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 00/19] Index-v5
  2013-07-14  2:59 ` [PATCH v2 00/19] Index-v5 Duy Nguyen
@ 2013-07-15  9:30   ` Thomas Gummerer
  2013-07-15  9:38     ` Duy Nguyen
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-15  9:30 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

Duy Nguyen <pclouds@gmail.com> writes:

> On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>>  t/perf/p0003-index.sh                            |   59 +
>>  t/t2104-update-index-skip-worktree.sh            |    1 +
>
> For such a big code addition, the test part seems modest. Speaking
> from my experience, I rarely run perf tests and "make test" does not
> activate v5 code at all. A few more tests would be nice. The good news
> is I changed default index version to 5 and ran "make test", all
> passed.

I was using the test suite with index version 5 as default index version
for testing of the new index file format.   I think that's the best way
to test the index, as it covers all aspects.  Maybe we should add a test
that covers the basic functionality, just to make sure nothing obvious
is broken when running the test suite with index-v2?  Something like
this maybe:

--->8---

>From c476b521c94f1a9b0b4fcfe92d63321442d79c9a Mon Sep 17 00:00:00 2001
From: Thomas Gummerer <t.gummerer@gmail.com>
Date: Mon, 15 Jul 2013 11:21:06 +0200
Subject: [PATCH] t1600: add basic test for index-v5

Add a test that checks the index-v5 file format when running the
test-suite with index-v2 as default index format.  When making changes
to the index, the test suite still should be run with both index v2 and
index v5 as default index format, for better coverage of all aspects of
the index.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 t/t1600-index-v5.sh | 133 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 133 insertions(+)
 create mode 100755 t/t1600-index-v5.sh

diff --git a/t/t1600-index-v5.sh b/t/t1600-index-v5.sh
new file mode 100755
index 0000000..528c17e
--- /dev/null
+++ b/t/t1600-index-v5.sh
@@ -0,0 +1,133 @@
+#!/bin/sh
+#
+# Copyright (c) 2013 Thomas Gummerer
+
+test_description='Test basic functionaltiy of index-v5.
+
+This test just covers the basics, to make sure normal runs of the test
+suite cover this version of the index file format too.  For better
+testing of the index-v5 format, the default index version should be
+changed to 5 and the test suite should be re-run'
+
+. ./test-lib.sh
+
+check_resolve_undo () {
+	msg=$1
+	shift
+	while case $# in
+	0)	break ;;
+	1|2|3)	die "Bug in check-resolve-undo test" ;;
+	esac
+	do
+		path=$1
+		shift
+		for stage in 1 2 3
+		do
+			sha1=$1
+			shift
+			case "$sha1" in
+			'') continue ;;
+			esac
+			sha1=$(git rev-parse --verify "$sha1")
+			printf "100644 %s %s\t%s\n" $sha1 $stage $path
+		done
+	done >"$msg.expect" &&
+	git ls-files --resolve-undo >"$msg.actual" &&
+	test_cmp "$msg.expect" "$msg.actual"
+}
+
+prime_resolve_undo () {
+	git reset --hard &&
+	git checkout second^0 &&
+	test_tick &&
+	test_must_fail git merge third^0 &&
+	echo merge does not leave anything &&
+	check_resolve_undo empty &&
+	echo different >fi/le &&
+	git add fi/le &&
+	echo resolving records &&
+	check_resolve_undo recorded fi/le initial:fi/le second:fi/le third:fi/le
+}
+
+test_expect_success 'setup' '
+	git update-index --index-version=5 &&
+	echo file1 >file1 &&
+	echo file2 >file2 &&
+	mkdir -p top/sub &&
+	echo x >top/x &&
+	echo xy >top/xy &&
+	echo y >top/y &&
+	echo yx >top/yx &&
+	echo sub1 >top/sub/sub1 &&
+	git add . &&
+	git commit -m "initial import"
+'
+
+test_expect_success 'ls-files shows all files' '
+	cat >expected <<-EOF &&
+	100644 e2129701f1a4d54dc44f03c93bca0a2aec7c5449 0	file1
+	100644 6c493ff740f9380390d5c9ddef4af18697ac9375 0	file2
+	100644 48df0cb83fee5d667537343f60a6057a63dd3c9b 0	top/sub/sub1
+	100644 587be6b4c3f93f93c489c0111bba5596147a26cb 0	top/x
+	100644 5aad9376af82d7b98a34f95fb0f298a162f52656 0	top/xy
+	100644 975fbec8256d3e8a3797e7a3611380f27c49f4ac 0	top/y
+	100644 ba1575927fa5b1f4bce72ad0c349566f1b02508e 0	top/yx
+	EOF
+	git ls-files --stage >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success 'ls-files with pathspec in subdir' '
+	cd top/sub &&
+	cat >expected <<-EOF &&
+	../x
+	../xy
+	EOF
+	git ls-files --error-unmatch ../x* >actual &&
+	test_cmp expected actual &&
+	cd ../..
+'
+
+test_expect_success 'read-tree HEAD establishes cache-tree' '
+	git read-tree HEAD &&
+	cat >expected <<-EOF &&
+	84e73410ea7864ccada24d897462e8ce1e1b872b  (7 entries, 1 subtrees)
+	602482536bd3852e8ac2977ed1a9913a8c244aa0 top/ (5 entries, 1 subtrees)
+	20bb0109200f37a7e19283b4abc4a672be3f0adb top/sub/ (1 entries, 0 subtrees)
+	EOF
+	test-dump-cache-tree >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success 'setup resolve-undo' '
+	mkdir fi &&
+	printf "a\0a" >binary &&
+	git add binary &&
+	test_commit initial fi/le first &&
+	git branch side &&
+	git branch another &&
+	printf "a\0b" >binary &&
+	git add binary &&
+	test_commit second fi/le second &&
+	git checkout side &&
+	test_commit third fi/le third &&
+	git branch add-add &&
+	git checkout another &&
+	test_commit fourth fi/le fourth &&
+	git checkout add-add &&
+	test_commit fifth add-differently &&
+	git checkout master
+'
+
+test_expect_success 'add records switch clears' '
+	prime_resolve_undo &&
+	test_tick &&
+	git commit -m merged &&
+	echo committing keeps &&
+	check_resolve_undo kept fi/le initial:fi/le second:fi/le third:fi/le &&
+	git checkout second^0 &&
+	echo switching clears &&
+	check_resolve_undo cleared
+'
+
+test_done
--
1.8.3.453.g1dfc63d

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 00/19] Index-v5
  2013-07-15  9:30   ` Thomas Gummerer
@ 2013-07-15  9:38     ` Duy Nguyen
  2013-07-17  8:12       ` Thomas Gummerer
  0 siblings, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-15  9:38 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

On Mon, Jul 15, 2013 at 4:30 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> Duy Nguyen <pclouds@gmail.com> writes:
>
>> On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>>>  t/perf/p0003-index.sh                            |   59 +
>>>  t/t2104-update-index-skip-worktree.sh            |    1 +
>>
>> For such a big code addition, the test part seems modest. Speaking
>> from my experience, I rarely run perf tests and "make test" does not
>> activate v5 code at all. A few more tests would be nice. The good news
>> is I changed default index version to 5 and ran "make test", all
>> passed.
>
> I was using the test suite with index version 5 as default index version
> for testing of the new index file format.   I think that's the best way
> to test the index, as it covers all aspects.

We need other people to run the test suite with v5 to catch bugs that
only appear on other platforms. Perhaps an env variable to allow the
test suite to override the binary's default index version and a new
make target "test-v5" maybe.

> Maybe we should add a test
> that covers the basic functionality, just to make sure nothing obvious
> is broken when running the test suite with index-v2?  Something like
> this maybe:

Yep. v5 specfic test cases can cover some corner cases that the rest
of the test suite does not care. Haven't read your t1600 though.
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 08/19] grep.c: Use index api
  2013-07-14  3:32   ` Duy Nguyen
@ 2013-07-15  9:51     ` Thomas Gummerer
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-15  9:51 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

Duy Nguyen <pclouds@gmail.com> writes:

> On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> +static int grep_cache(struct cache_entry *ce, void *cb_data)
>>  {
>> -       int hit = 0;
>> -       int nr;
>> -       read_cache();
>> +       struct grep_opts *opts = cb_data;
>>
>> -       for (nr = 0; nr < active_nr; nr++) {
>> -               struct cache_entry *ce = active_cache[nr];
>> -               if (!S_ISREG(ce->ce_mode))
>> -                       continue;
>> -               if (!match_pathspec_depth(pathspec, ce->name, ce_namelen(ce), 0, NULL))
>> -                       continue;
>> -               /*
>> -                * If CE_VALID is on, we assume worktree file and its cache entry
>> -                * are identical, even if worktree file has been modified, so use
>> -                * cache version instead
>> -                */
>> -               if (cached || (ce->ce_flags & CE_VALID) || ce_skip_worktree(ce)) {
>> -                       if (ce_stage(ce))
>> -                               continue;
>> -                       hit |= grep_sha1(opt, ce->sha1, ce->name, 0, ce->name);
>> -               }
>> -               else
>> -                       hit |= grep_file(opt, ce->name);
>> -               if (ce_stage(ce)) {
>> -                       do {
>> -                               nr++;
>> -                       } while (nr < active_nr &&
>> -                                !strcmp(ce->name, active_cache[nr]->name));
>> -                       nr--; /* compensate for loop control */
>> -               }
>> -               if (hit && opt->status_only)
>> -                       break;
>> -       }
>> -       return hit;
>> +       if (!S_ISREG(ce->ce_mode))
>> +               return 0;
>> +       if (!match_pathspec_depth(opts->pathspec, ce->name, ce_namelen(ce), 0, NULL))
>> +               return 0;
>
> You do a match_pathspec_depth here..
>
>> @@ -895,10 +887,21 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>>         } else if (0 <= opt_exclude) {
>>                 die(_("--[no-]exclude-standard cannot be used for tracked contents."));
>>         } else if (!list.nr) {
>> +               struct grep_opts opts;
>> +               struct filter_opts *filter_opts = xmalloc(sizeof(*filter_opts));
>> +
>>                 if (!cached)
>>                         setup_work_tree();
>>
>> -               hit = grep_cache(&opt, &pathspec, cached);
>> +               memset(filter_opts, 0, sizeof(*filter_opts));
>> +               filter_opts->pathspec = &pathspec;
>> +               opts.opt = &opt;
>> +               opts.pathspec = &pathspec;
>> +               opts.cached = cached;
>> +               opts.hit = 0;
>> +               read_cache_filtered(filter_opts);
>> +               for_each_cache_entry(grep_cache, &opts);
>
> And here again inside for_each_cache_entry. In the worst case that
> could turn into 2 expensive fnmatch instead of one. Is this conversion
> worth it? Note that match_pathspec is just a deprecated version of
> match_pathspec_depth. They basically do the same thing.

Right, the match_pathspec_depth should in builtin/grep.c should be
removed, it's unnecessary when using for_each_index_entry.  Thanks for
spotting it.  Other than that I still think the change makes sense.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/19] read-cache: read index-v5
  2013-07-12 17:26 ` [PATCH v2 12/19] read-cache: read index-v5 Thomas Gummerer
  2013-07-14  4:42   ` Duy Nguyen
@ 2013-07-15 10:12   ` Duy Nguyen
  2013-07-17  8:11     ` Thomas Gummerer
  2013-08-07  8:23     ` Thomas Gummerer
  1 sibling, 2 replies; 51+ messages in thread
From: Duy Nguyen @ 2013-07-15 10:12 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

A little bit more..

On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> +static void ce_queue_push(struct cache_entry **head,
> +                            struct cache_entry **tail,
> +                            struct cache_entry *ce)
> +{
> +       if (!*head) {
> +               *head = *tail = ce;
> +               (*tail)->next = NULL;
> +               return;
> +       }
> +
> +       (*tail)->next = ce;
> +       ce->next = NULL;
> +       *tail = (*tail)->next;

No no no. "next" is for name-hash.c don't "reuse" it here. And I don't
think you really need to maintain a linked list of cache_entries by
the way. More on read_entries comments below..

> +}
> +
> +static struct cache_entry *ce_queue_pop(struct cache_entry **head)
> +{
> +       struct cache_entry *ce;
> +
> +       ce = *head;
> +       *head = (*head)->next;
> +       return ce;
> +}

Same as ce_queue_push.

> +static int read_entries(struct index_state *istate, struct directory_entry **de,
> +                       unsigned int *entry_offset, void **mmap,
> +                       unsigned long mmap_size, unsigned int *nr,
> +                       unsigned int *foffsetblock)
> +{
> +       struct cache_entry *head = NULL, *tail = NULL;
> +       struct conflict_entry *conflict_queue;
> +       struct cache_entry *ce;
> +       int i;
> +
> +       conflict_queue = NULL;
> +       if (read_conflicts(&conflict_queue, *de, mmap, mmap_size) < 0)
> +               return -1;
> +       for (i = 0; i < (*de)->de_nfiles; i++) {
> +               if (read_entry(&ce,
> +                              *de,
> +                              entry_offset,
> +                              mmap,
> +                              mmap_size,
> +                              foffsetblock) < 0)
> +                       return -1;
> +               ce_queue_push(&head, &tail, ce);
> +               *foffsetblock += 4;
> +
> +               /*
> +                * Add the conflicted entries at the end of the index file
> +                * to the in memory format
> +                */
> +               if (conflict_queue &&
> +                   (conflict_queue->entries->flags & CONFLICT_CONFLICTED) != 0 &&
> +                   !cache_name_compare(conflict_queue->name, conflict_queue->namelen,
> +                                       ce->name, ce_namelen(ce))) {
> +                       struct conflict_part *cp;
> +                       cp = conflict_queue->entries;
> +                       cp = cp->next;
> +                       while (cp) {
> +                               ce = convert_conflict_part(cp,
> +                                               conflict_queue->name,
> +                                               conflict_queue->namelen);
> +                               ce_queue_push(&head, &tail, ce);
> +                               conflict_part_head_remove(&cp);
> +                       }
> +                       conflict_entry_head_remove(&conflict_queue);
> +               }

I start to wonder if separating staged entries is a good idea. It
seems to make the code more complicated. The good point about conflict
section at the end of the file is you can just truncate() it out.
Another way is putting staged entries in fileentries, sorted
alphabetically then by stage number, and a flag indicating if the
entry is valid. When you remove resolve an entry, just set the flag to
invalid (partial write), so that read code will skip it.

I think this approach is reasonably cheap (unless there are a lot of
conflicts) and it simplifies this piece of code. truncate() may be
overrated anyway. In my experience, I "git add <path>" as soon as I
resolve <path> (so that "git diff" shrinks). One entry at a time, one
index write at a time. I don't think I ever resolve everything then
"git add -u .", which is where truncate() shines because staged
entries are removed all at once. We should optimize for one file
resolution at a time, imo.

> +       }
> +
> +       *de = (*de)->next;
> +
> +       while (head) {
> +               if (*de != NULL
> +                   && strcmp(head->name, (*de)->pathname) > 0) {
> +                       read_entries(istate,
> +                                    de,
> +                                    entry_offset,
> +                                    mmap,
> +                                    mmap_size,
> +                                    nr,
> +                                    foffsetblock);
> +               } else {
> +                       ce = ce_queue_pop(&head);
> +                       set_index_entry(istate, *nr, ce);
> +                       (*nr)++;
> +                       ce->next = NULL;
> +               }

In this loop, you do some sort of merge sort combining a list of
sorted files and a list of sorted directories (which will be expanded
to bunches of files by the recursive read_entries). strcmp() does two
things. Assume we're in directory 'a', it makes sure that subdirectory
'a/b' is only inserted after file 'a/a' is inserted to maintain index
sort order. It also makes sure that 'de' of an outside directory 'b'
will not be expanded here. strcmp will be called for every file,
checking _full_ path. Sounds expensive.

If you maintain two pointers first_subdir and next_sibling in "de" as
in my previous mail, you know "de" 'b' is out without string
comparison (de->next_sibling would be NULL). With that, the purpose of
strcmp is simply making sure that 'a/b' is inserted after 'a/a'.
Because we know all possible "de" is a subdirectory of 'a', we don't
need to compare the prefix 'a/'. We still need to strcmp on every
file, but we only need to compare the file name, not the leading
directory anymore. Cheaper.

Going further, I wonder if we could eliminate strcmp completely. We
could store dummy entries in fileentries to mark the positions of the
subdirectories. When we encounter a dummy entry, we call
read_entries() instead of set_index_entry.

If you merge the "read_entry" for loop in this while loop, you don't
need to maintain a linked list of ce to pop out one by one here.

> +       }
> +       return 0;
> +}
> +
-- 
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 00/19] Index-v5
  2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
                   ` (19 preceding siblings ...)
  2013-07-14  2:59 ` [PATCH v2 00/19] Index-v5 Duy Nguyen
@ 2013-07-16 21:03 ` Ramsay Jones
  2013-07-17  8:04   ` Thomas Gummerer
  20 siblings, 1 reply; 51+ messages in thread
From: Ramsay Jones @ 2013-07-16 21:03 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: git, trast, mhagger, gitster, pclouds, robin.rosenberg, sunshine

Thomas Gummerer wrote:
> Hi,
> 
> previous rounds (without api) are at $gmane/202752, $gmane/202923,
> $gmane/203088 and $gmane/203517, the previous round with api was at
> $gmane/229732.  Thanks to Junio, Duy and Eric for their comments on
> the previous round.

If I remember correctly, the original version of this series had the
same problem as Michael's "Fix some reference-related races" series
(now in master). In particular, you had introduced an 'index_changed()'
function which does essentially the same job as 'stat_validity_check()'
in the new reference handling API. I seem to remember advising you
not to compare st_uid, st_gid and st_ino on __CYGWIN__.

I haven't had time to look at this version of your series yet, but it
may be worth taking a look at stat_validity_check(). (although that is
causing failures on cygwin at the moment! ;-)

Also, I can't recall if I mentioned it to you at the time, but your
index reading code was (unnecessarily) calling munmap() twice on the
same buffer (without an intervening mmap()). This causes problems for
systems that have the NO_MMAP build variable set. In particular, the
compat/mmap.c code will attempt to free() the allocated memory block
twice, with unpredictable results.

I wrote a patch to address this at the time (Hmm, seems to be built
on v1.8.1), but didn't submit it since your patch didn't progress. :-D
I have included the patch below.

ATB,
Ramsay Jones

-- >8 --
From: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Date: Sun, 9 Sep 2012 20:50:32 +0100
Subject: [PATCH] mmap.c: Keep log of mmap() blocks to avoid double-delete bug

When compiling with the NO_MMAP build variable set, the built-in
'git_mmap()' and 'git_munmap()' compatibility routines use simple
memory allocation and file I/O to emulate the required behaviour.
The current implementation is vulnerable to the "double-delete" bug
(where the pointer returned by malloc() is passed to free() two or
more times), should the mapped memory block address be passed to
munmap() multiple times.

In order to guard the implementation from such a calling sequence,
we keep a list of mmap-block descriptors, which we then consult to
determine the validity of the input pointer to munmap(). This then
allows 'git_munmap()' to return -1 on error, as required, with
errno set to EINVAL.

Using a list in the log of mmap-ed blocks, along with the resulting
linear search, means that the performance of the code is directly
proportional to the number of concurrently active memory mapped
file regions. The number of such regions is not expected to be
excessive.

Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
---
 compat/mmap.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/compat/mmap.c b/compat/mmap.c
index c9d46d1..400e034 100644
--- a/compat/mmap.c
+++ b/compat/mmap.c
@@ -1,14 +1,61 @@
 #include "../git-compat-util.h"

+struct mmbd {  /* memory mapped block descriptor */
+	struct mmbd *next;  /* next in list */
+	void   *start;      /* pointer to memory mapped block */
+	size_t length;      /* length of memory mapped block */
+};
+
+static struct mmbd *head;  /* head of mmb descriptor list */
+
+
+static void add_desc(struct mmbd *desc, void *start, size_t length)
+{
+	desc->start = start;
+	desc->length = length;
+	desc->next = head;
+	head = desc;
+}
+
+static void free_desc(struct mmbd *desc)
+{
+	if (head == desc)
+		head = head->next;
+	else {
+		struct mmbd *d = head;
+		for (; d; d = d->next) {
+			if (d->next == desc) {
+				d->next = desc->next;
+				break;
+			}
+		}
+	}
+	free(desc);
+}
+
+static struct mmbd *find_desc(void *start)
+{
+	struct mmbd *d = head;
+	for (; d; d = d->next) {
+		if (d->start == start)
+			return d;
+	}
+	return NULL;
+}
+
 void *git_mmap(void *start, size_t length, int prot, int flags, int fd, off_t offset)
 {
 	size_t n = 0;
+	struct mmbd *desc = NULL;

 	if (start != NULL || !(flags & MAP_PRIVATE))
 		die("Invalid usage of mmap when built with NO_MMAP");

 	start = xmalloc(length);
-	if (start == NULL) {
+	desc = xmalloc(sizeof(*desc));
+	if (!start || !desc) {
+		free(start);
+		free(desc);
 		errno = ENOMEM;
 		return MAP_FAILED;
 	}
@@ -25,18 +72,26 @@ void *git_mmap(void *start, size_t length, int prot, int flags, int fd, off_t of
 			if (errno == EAGAIN || errno == EINTR)
 				continue;
 			free(start);
+			free(desc);
 			errno = EACCES;
 			return MAP_FAILED;
 		}

 		n += count;
 	}
+	add_desc(desc, start, length);

 	return start;
 }

 int git_munmap(void *start, size_t length)
 {
+	struct mmbd *d = find_desc(start);
+	if (!d) {
+		errno = EINVAL;
+		return -1;
+	}
+	free_desc(d);
 	free(start);
 	return 0;
 }
-- 
1.8.3

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 00/19] Index-v5
  2013-07-16 21:03 ` Ramsay Jones
@ 2013-07-17  8:04   ` Thomas Gummerer
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-17  8:04 UTC (permalink / raw)
  To: Ramsay Jones
  Cc: git, trast, mhagger, gitster, pclouds, robin.rosenberg, sunshine

Ramsay Jones <ramsay@ramsay1.demon.co.uk> writes:

> Thomas Gummerer wrote:
>> Hi,
>>
>> previous rounds (without api) are at $gmane/202752, $gmane/202923,
>> $gmane/203088 and $gmane/203517, the previous round with api was at
>> $gmane/229732.  Thanks to Junio, Duy and Eric for their comments on
>> the previous round.
>
> If I remember correctly, the original version of this series had the
> same problem as Michael's "Fix some reference-related races" series
> (now in master). In particular, you had introduced an 'index_changed()'
> function which does essentially the same job as 'stat_validity_check()'
> in the new reference handling API. I seem to remember advising you
> not to compare st_uid, st_gid and st_ino on __CYGWIN__.

Yes, you provided a patch that simply wrapped those checks in a #if
!defined (__CYGWIN__), which is included in the new series too.


> I haven't had time to look at this version of your series yet, but it
> may be worth taking a look at stat_validity_check(). (although that is
> causing failures on cygwin at the moment! ;-)

I took a quick look, that function makes sense I think.  I'll use it in
the re-roll.  It makes probably sense to wrap the uid, gid and ino
fields as in the index_changed function.


> Also, I can't recall if I mentioned it to you at the time, but your
> index reading code was (unnecessarily) calling munmap() twice on the
> same buffer (without an intervening mmap()). This causes problems for
> systems that have the NO_MMAP build variable set. In particular, the
> compat/mmap.c code will attempt to free() the allocated memory block
> twice, with unpredictable results.
>
> I wrote a patch to address this at the time (Hmm, seems to be built
> on v1.8.1), but didn't submit it since your patch didn't progress. :-D
> I have included the patch below.

I can't recall this either.  From a quick check I don't call munmap() on
a already unmapped mmap, so I think this is fine as it is and your patch
is independent from it.  Not sure if it makes sense as safeguard for
future changes.

> -- >8 --
> From: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
> Date: Sun, 9 Sep 2012 20:50:32 +0100
> Subject: [PATCH] mmap.c: Keep log of mmap() blocks to avoid double-delete bug
>
> When compiling with the NO_MMAP build variable set, the built-in
> 'git_mmap()' and 'git_munmap()' compatibility routines use simple
> memory allocation and file I/O to emulate the required behaviour.
> The current implementation is vulnerable to the "double-delete" bug
> (where the pointer returned by malloc() is passed to free() two or
> more times), should the mapped memory block address be passed to
> munmap() multiple times.
>
> In order to guard the implementation from such a calling sequence,
> we keep a list of mmap-block descriptors, which we then consult to
> determine the validity of the input pointer to munmap(). This then
> allows 'git_munmap()' to return -1 on error, as required, with
> errno set to EINVAL.
>
> Using a list in the log of mmap-ed blocks, along with the resulting
> linear search, means that the performance of the code is directly
> proportional to the number of concurrently active memory mapped
> file regions. The number of such regions is not expected to be
> excessive.
>
> Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
> ---
>  compat/mmap.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 56 insertions(+), 1 deletion(-)
>
> diff --git a/compat/mmap.c b/compat/mmap.c
> index c9d46d1..400e034 100644
> --- a/compat/mmap.c
> +++ b/compat/mmap.c
> @@ -1,14 +1,61 @@
>  #include "../git-compat-util.h"
>
> +struct mmbd {  /* memory mapped block descriptor */
> +	struct mmbd *next;  /* next in list */
> +	void   *start;      /* pointer to memory mapped block */
> +	size_t length;      /* length of memory mapped block */
> +};
> +
> +static struct mmbd *head;  /* head of mmb descriptor list */
> +
> +
> +static void add_desc(struct mmbd *desc, void *start, size_t length)
> +{
> +	desc->start = start;
> +	desc->length = length;
> +	desc->next = head;
> +	head = desc;
> +}
> +
> +static void free_desc(struct mmbd *desc)
> +{
> +	if (head == desc)
> +		head = head->next;
> +	else {
> +		struct mmbd *d = head;
> +		for (; d; d = d->next) {
> +			if (d->next == desc) {
> +				d->next = desc->next;
> +				break;
> +			}
> +		}
> +	}
> +	free(desc);
> +}
> +
> +static struct mmbd *find_desc(void *start)
> +{
> +	struct mmbd *d = head;
> +	for (; d; d = d->next) {
> +		if (d->start == start)
> +			return d;
> +	}
> +	return NULL;
> +}
> +
>  void *git_mmap(void *start, size_t length, int prot, int flags, int fd, off_t offset)
>  {
>  	size_t n = 0;
> +	struct mmbd *desc = NULL;
>
>  	if (start != NULL || !(flags & MAP_PRIVATE))
>  		die("Invalid usage of mmap when built with NO_MMAP");
>
>  	start = xmalloc(length);
> -	if (start == NULL) {
> +	desc = xmalloc(sizeof(*desc));
> +	if (!start || !desc) {
> +		free(start);
> +		free(desc);
>  		errno = ENOMEM;
>  		return MAP_FAILED;
>  	}
> @@ -25,18 +72,26 @@ void *git_mmap(void *start, size_t length, int prot, int flags, int fd, off_t of
>  			if (errno == EAGAIN || errno == EINTR)
>  				continue;
>  			free(start);
> +			free(desc);
>  			errno = EACCES;
>  			return MAP_FAILED;
>  		}
>
>  		n += count;
>  	}
> +	add_desc(desc, start, length);
>
>  	return start;
>  }
>
>  int git_munmap(void *start, size_t length)
>  {
> +	struct mmbd *d = find_desc(start);
> +	if (!d) {
> +		errno = EINVAL;
> +		return -1;
> +	}
> +	free_desc(d);
>  	free(start);
>  	return 0;
>  }
> --
> 1.8.3

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 09/19] ls-files.c: use index api
  2013-07-14  3:39   ` Duy Nguyen
@ 2013-07-17  8:07     ` Thomas Gummerer
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-17  8:07 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

Duy Nguyen <pclouds@gmail.com> writes:

> On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> +       if (!with_tree) {
>> +               memset(opts, 0, sizeof(*opts));
>> +               opts->pathspec = &pathspec_struct;
>> +               opts->read_staged = 1;
>> +               if (show_resolve_undo)
>> +                       opts->read_resolve_undo = 1;
>> +               read_cache_filtered(opts);
>
> So you load partial index here.
>
>> +       } else {
>> +               read_cache();
>> +       }
>> +       /* be nice with submodule paths ending in a slash */
>> +       if (pathspec)
>> +               strip_trailing_slash_from_submodules();
>
> Then strip_trailing_slash_from_submodules will attempt to convert
> pathspec "foo/" to "foo" if "foo" exists in the index and is a
> gitlink. But becaues "foo/" is used to load the partial index, "foo"
> is not loaded (is it?) and this could become incorrect no-op. I
> suggest you go through the pathspec once checking for ones ending with
> '/'. If so strip_trailing_... may potentially update pathspec, just
> load full index. If no pathspec ends with '/', strip_trail... is no-op
> and we can do partial loading safely.

It was loaded, because the adjusted_pathspec algorithm stripped the
trailing slash and then everything until the next slash.  That's
overkill except when the trailing slash had to be stripped.

I'll make the adjusted_pathspec algorithm more restrictive, so the last
trailing slash is no longer stripped.  If a pathspec contains a trailing
slash I'll load the whole index, as you suggested.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 10/19] documentation: add documentation of the index-v5 file format
  2013-07-14  3:59   ` Duy Nguyen
@ 2013-07-17  8:09     ` Thomas Gummerer
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-17  8:09 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

Duy Nguyen <pclouds@gmail.com> writes:

> On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> +== Directory offsets (diroffsets)
>> +
>> +  diroffset (32-bits): offset to the directory relative to the beginning
>> +    of the index file. There are ndir + 1 offsets in the diroffset table,
>> +    the last is pointing to the end of the last direntry. With this last
>> +    entry, we are able to replace the strlen of when reading the directory
>
> strlen of what?

Of the dirname.  Will fix in the re-roll, thanks for noticing.

>> +    name, by calculating it from diroffset[n+1]-diroffset[n]-61.  61 is the
>> +    size of the directory data, which follows each each directory + the
>> +    crc sum + the NUL byte.
>> +
>> +  This part is needed for making the directory entries bisectable and
>> +    thus allowing a binary search.
>
> Just thinking out loud. Maybe this section and fileoffsets should be
> made optional extensions. So far I see no use for them. It's nice for
> a program to occasionally look at a single entry, but such programs do
> not exist (yet). For inotify monitor that may want to update a single
> file's stat, it could regenerate the index with {dir,file}offsets
> extensions the first time it attempts to update the index, then it
> could do bsearch.

There is a use already, namely saving the strlen for the filename.
Previously we had the length of the filename in the lower bits of the
flags, but decided to remove it to avoid having extended flags.  That in
addition to the use case in the future warrants keeping them in the
index, I think.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/19] read-cache: read index-v5
  2013-07-15 10:12   ` Duy Nguyen
@ 2013-07-17  8:11     ` Thomas Gummerer
  2013-08-08  2:00       ` Duy Nguyen
  2013-08-07  8:23     ` Thomas Gummerer
  1 sibling, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-17  8:11 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

Duy Nguyen <pclouds@gmail.com> writes:

[..]

>> +static int read_entries(struct index_state *istate, struct directory_entry **de,
>> +                       unsigned int *entry_offset, void **mmap,
>> +                       unsigned long mmap_size, unsigned int *nr,
>> +                       unsigned int *foffsetblock)
>> +{
>> +       struct cache_entry *head = NULL, *tail = NULL;
>> +       struct conflict_entry *conflict_queue;
>> +       struct cache_entry *ce;
>> +       int i;
>> +
>> +       conflict_queue = NULL;
>> +       if (read_conflicts(&conflict_queue, *de, mmap, mmap_size) < 0)
>> +               return -1;
>> +       for (i = 0; i < (*de)->de_nfiles; i++) {
>> +               if (read_entry(&ce,
>> +                              *de,
>> +                              entry_offset,
>> +                              mmap,
>> +                              mmap_size,
>> +                              foffsetblock) < 0)
>> +                       return -1;
>> +               ce_queue_push(&head, &tail, ce);
>> +               *foffsetblock += 4;
>> +
>> +               /*
>> +                * Add the conflicted entries at the end of the index file
>> +                * to the in memory format
>> +                */
>> +               if (conflict_queue &&
>> +                   (conflict_queue->entries->flags & CONFLICT_CONFLICTED) != 0 &&
>> +                   !cache_name_compare(conflict_queue->name, conflict_queue->namelen,
>> +                                       ce->name, ce_namelen(ce))) {
>> +                       struct conflict_part *cp;
>> +                       cp = conflict_queue->entries;
>> +                       cp = cp->next;
>> +                       while (cp) {
>> +                               ce = convert_conflict_part(cp,
>> +                                               conflict_queue->name,
>> +                                               conflict_queue->namelen);
>> +                               ce_queue_push(&head, &tail, ce);
>> +                               conflict_part_head_remove(&cp);
>> +                       }
>> +                       conflict_entry_head_remove(&conflict_queue);
>> +               }
>
> I start to wonder if separating staged entries is a good idea. It
> seems to make the code more complicated. The good point about conflict
> section at the end of the file is you can just truncate() it out.
> Another way is putting staged entries in fileentries, sorted
> alphabetically then by stage number, and a flag indicating if the
> entry is valid. When you remove resolve an entry, just set the flag to
> invalid (partial write), so that read code will skip it.
>
> I think this approach is reasonably cheap (unless there are a lot of
> conflicts) and it simplifies this piece of code. truncate() may be
> overrated anyway. In my experience, I "git add <path>" as soon as I
> resolve <path> (so that "git diff" shrinks). One entry at a time, one
> index write at a time. I don't think I ever resolve everything then
> "git add -u .", which is where truncate() shines because staged
> entries are removed all at once. We should optimize for one file
> resolution at a time, imo.

Thanks for your comments.  I'll address the other ones once we decided
to do with the conflicts.

It does make the code quite a bit more complicated, but also has one
advantage that you overlooked.  We wouldn't truncate() when resolving
the conflicts.  The resolve undo data is stored with the conflicts and
therefore we could just flip a bit and set the stage of the cache-entry
in the main index to 0 (always once we got partial writing).  This can
be fast both in case we resolve one entry at a time and when we resolve
a lot of entries.  The advantage is even bigger when we resolve one
entry at a time, when we otherwise would have to re-write the index for
each conflict resolution.

[..]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 00/19] Index-v5
  2013-07-15  9:38     ` Duy Nguyen
@ 2013-07-17  8:12       ` Thomas Gummerer
  2013-07-17 23:58         ` Junio C Hamano
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-17  8:12 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

Duy Nguyen <pclouds@gmail.com> writes:

> On Mon, Jul 15, 2013 at 4:30 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> Duy Nguyen <pclouds@gmail.com> writes:
>>
>>> On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>>>>  t/perf/p0003-index.sh                            |   59 +
>>>>  t/t2104-update-index-skip-worktree.sh            |    1 +
>>>
>>> For such a big code addition, the test part seems modest. Speaking
>>> from my experience, I rarely run perf tests and "make test" does not
>>> activate v5 code at all. A few more tests would be nice. The good news
>>> is I changed default index version to 5 and ran "make test", all
>>> passed.
>>
>> I was using the test suite with index version 5 as default index version
>> for testing of the new index file format.   I think that's the best way
>> to test the index, as it covers all aspects.
>
> We need other people to run the test suite with v5 to catch bugs that
> only appear on other platforms. Perhaps an env variable to allow the
> test suite to override the binary's default index version and a new
> make target "test-v5" maybe.

Ah ok, I understand.  I think it's best to add a GIT_INDEX_VERSION=x
config option to config.mak, where x is the index version that should be
tested.  This is also the way other test options are set
(e.g. NO_SVN_TESTS).  In addition this allows testing other versions of
the index file too.

>> Maybe we should add a test
>> that covers the basic functionality, just to make sure nothing obvious
>> is broken when running the test suite with index-v2?  Something like
>> this maybe:
>
> Yep. v5 specfic test cases can cover some corner cases that the rest
> of the test suite does not care. Haven't read your t1600 though.

Ok.  I can't think of any corner cases right now that would be in v5,
but not in other versions, but if I catch some I'll add tests.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 07/19] make sure partially read index is not changed
  2013-07-14  3:29   ` Duy Nguyen
@ 2013-07-17 12:56     ` Thomas Gummerer
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-17 12:56 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

Duy Nguyen <pclouds@gmail.com> writes:

> On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> A partially read index file currently cannot be written to disk.  Make
>> sure that never happens, by erroring out when a caller tries to change a
>> partially read index.  The caller is responsible for reading the whole
>> index when it's trying to change it later.
>>
>> Forcing the caller to load the right part of the index file instead of
>> re-reading it when changing it, gives a bit of a performance advantage,
>> by avoiding to read parts of the index twice.
>
> If you want to be strict about this, put similar check in
> fill_stat_cache_info (used by entry.c), cache_tree_invalidate_path and
> convert the code base (maybe except unpack-trees.c, just put a check
> in the beginning of unpack_trees()) to use wrapper function to change
> ce_flags (some still update ce_flags directly, grep them). Some flags
> are in memory only and should be allowed to change in partial index,
> some are to be written on disk and should not be allowed.

I'm not sure anymore we should even be this strict.  A partially read
index will never make it to the disk, because write_index checks if it's
fully read.   I think the check in write_index and read_index_filtered
would be enough.

The only disadvantage would be that it takes a little longer to catch an
error that should never happen in the first place.  If it occurs the
user has bigger problems than the time that not having caught the
error earlier adds to the execution of the command.

I also don't see a clean way to add the check to fill_stat_cache_info or
cache_tree_invalidate_path, because we would need an additional
parameter for the index_state, or at least for index_state->filter_opts,
which doesn't do anything but check if the index is only partially loaded.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 00/19] Index-v5
  2013-07-17  8:12       ` Thomas Gummerer
@ 2013-07-17 23:58         ` Junio C Hamano
  2013-07-19 17:37           ` Thomas Gummerer
  0 siblings, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2013-07-17 23:58 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Duy Nguyen, Git Mailing List, Thomas Rast, Michael Haggerty,
	Robin Rosenberg, Eric Sunshine

Thomas Gummerer <t.gummerer@gmail.com> writes:

> Ah ok, I understand.  I think it's best to add a GIT_INDEX_VERSION=x
> config option to config.mak, where x is the index version that should be
> tested.

Whatever you do, please do not call it GIT_INDEX_VERSION _if_ it is
only to be used while testing.  Have string "TEST" somewhere in the
name, and make t/test-lib-functions.sh take notice.

Currently, the way for the user to show the preference as to which
index version to use is to explicitly set the version once, and then
we will (try to) propagate it inside the repository.

I would not mind if we add a mechanism to make write_index() notice
the environment variable GIT_INDEX_VERSION and write the index in
that version.  But that is conceptually very different from whatever
you give "make VARIABLE=blah" from the command line when building
Git (or set in config.mak which amounts to the same thing).

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/19] read-cache: move index v2 specific functions to their own file
  2013-07-14  3:10   ` Duy Nguyen
@ 2013-07-19 14:53     ` Thomas Gummerer
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-19 14:53 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

Duy Nguyen <pclouds@gmail.com> writes:

> On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> @@ -489,8 +479,8 @@ extern void *read_blob_data_from_index(struct index_state *, const char *, unsig
>>  #define CE_MATCH_RACY_IS_DIRTY         02
>>  /* do stat comparison even if CE_SKIP_WORKTREE is true */
>>  #define CE_MATCH_IGNORE_SKIP_WORKTREE  04
>> -extern int ie_match_stat(const struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
>> -extern int ie_modified(const struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
>> +extern int ie_match_stat(struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
>> +extern int ie_modified(struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
>>
>
> I would rather we keep "const struct index_state*" if we could. I
> tried putting "const" back and found that ce_match_stat_basic() calls
> set_istate_ops(), which writes to "struct index_state". Putting
> set_istate_ops() in ce_match_stat_basic() may seem convenient, but
> does not make much sense (why would a match_stat function update
> index_ops?). I think you could move it out and
>
>  - read_index calls set_istate_ops
>  - (bonus) discard_index probably should reset "version" field to zero
> and clear internal_ops
>  - the callers that use index without read_index must call
> initialize_index() or something, which in turn calls set_istate_ops.
> initialize_index() may take the preferred index version
>  - do not let update-index modifies version field directly when
> --index-version is given. Wrap it with set_index_version() or
> something, so we can do internal conversion from one version to
> another if needed
>  - remove set_istate_ops in write_index(), we may need internal_ops
> long before writing. When write_index is called, internal_ops should
> be already initialized

Ok, this makes sense.  The only thing that I'm a little worried about is
catching all code paths that need to initialize the index.  I'll
implement these suggestions in the re-roll.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 00/19] Index-v5
  2013-07-17 23:58         ` Junio C Hamano
@ 2013-07-19 17:37           ` Thomas Gummerer
  2013-07-19 18:25             ` Junio C Hamano
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-19 17:37 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Duy Nguyen, Git Mailing List, Thomas Rast, Michael Haggerty,
	Robin Rosenberg, Eric Sunshine

Junio C Hamano <gitster@pobox.com> writes:

> Thomas Gummerer <t.gummerer@gmail.com> writes:
>
>> Ah ok, I understand.  I think it's best to add a GIT_INDEX_VERSION=x
>> config option to config.mak, where x is the index version that should be
>> tested.
>
> Whatever you do, please do not call it GIT_INDEX_VERSION _if_ it is
> only to be used while testing.  Have string "TEST" somewhere in the
> name, and make t/test-lib-functions.sh take notice.
>
> Currently, the way for the user to show the preference as to which
> index version to use is to explicitly set the version once, and then
> we will (try to) propagate it inside the repository.
>
> I would not mind if we add a mechanism to make write_index() notice
> the environment variable GIT_INDEX_VERSION and write the index in
> that version.  But that is conceptually very different from whatever
> you give "make VARIABLE=blah" from the command line when building
> Git (or set in config.mak which amounts to the same thing).

What I currently did is add a environment variable GIT_INDEX_VERSION
that is used only if there is no index yet, to make sure existing
repositories aren't affected and still have to be converted explicitly
by using git update-index.

For the tests I simply did export GIT_INDEX_VERSION in test-lib.sh to
allow the addition of GIT_INDEX_VERSION in config.mak.  Should I rename
that to GIT_INDEX_VERSION_TEST and do something like

set_index_version() {
        export GIT_INDEX_VERSION=$GIT_INDEX_VERSION_TEST
}

in test-lib-functions.sh instead, does that make sense?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 00/19] Index-v5
  2013-07-19 17:37           ` Thomas Gummerer
@ 2013-07-19 18:25             ` Junio C Hamano
  0 siblings, 0 replies; 51+ messages in thread
From: Junio C Hamano @ 2013-07-19 18:25 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Duy Nguyen, Git Mailing List, Thomas Rast, Michael Haggerty,
	Robin Rosenberg, Eric Sunshine

Thomas Gummerer <t.gummerer@gmail.com> writes:

> What I currently did is add a environment variable GIT_INDEX_VERSION
> that is used only if there is no index yet, to make sure existing
> repositories aren't affected and still have to be converted explicitly
> by using git update-index.
>
> For the tests I simply did export GIT_INDEX_VERSION in test-lib.sh to
> allow the addition of GIT_INDEX_VERSION in config.mak.  Should I rename
> that to GIT_INDEX_VERSION_TEST and do something like
>
> set_index_version() {
>         export GIT_INDEX_VERSION=$GIT_INDEX_VERSION_TEST
> }
>
> in test-lib-functions.sh instead, does that make sense?

Mostly Yes ;-).  You have to write it in a valid POSIX shell, i.e.

	set_index_version () {
		GIT_INDEX_VERSION=$TEST_GIT_INDEX_VERSION
                export GIT_INDEX_VERSION
	}

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 10/19] documentation: add documentation of the index-v5 file format
  2013-07-12 17:26 ` [PATCH v2 10/19] documentation: add documentation of the index-v5 file format Thomas Gummerer
  2013-07-14  3:59   ` Duy Nguyen
@ 2013-08-04 11:26   ` Duy Nguyen
  2013-08-04 17:58     ` Thomas Gummerer
  1 sibling, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-08-04 11:26 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> +== Header
> +   sig (32-bits): Signature:
> +     The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
> +
> +   vnr (32-bits): Version number:
> +     The current supported versions are 2, 3, 4 and 5.
> +
> +   ndir (32-bits): number of directories in the index.
> +
> +   nfile (32-bits): number of file entries in the index.

I just noticed that "file" command can show the number of entries,
something like this

.git/index: Git index, version 4, 2625 entries

Maybe we can swap ndir and nfile to immitate pre-v5 format, so
unmodified "file" still shows correct number of entries/files with v5?
-- 
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 10/19] documentation: add documentation of the index-v5 file format
  2013-08-04 11:26   ` Duy Nguyen
@ 2013-08-04 17:58     ` Thomas Gummerer
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-08-04 17:58 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

Duy Nguyen <pclouds@gmail.com> writes:

> On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> +== Header
>> +   sig (32-bits): Signature:
>> +     The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
>> +
>> +   vnr (32-bits): Version number:
>> +     The current supported versions are 2, 3, 4 and 5.
>> +
>> +   ndir (32-bits): number of directories in the index.
>> +
>> +   nfile (32-bits): number of file entries in the index.
>
> I just noticed that "file" command can show the number of entries,
> something like this
>
> .git/index: Git index, version 4, 2625 entries
>
> Maybe we can swap ndir and nfile to immitate pre-v5 format, so
> unmodified "file" still shows correct number of entries/files with v5?

Cool, I didn't know that worked.  In that case it makes sense to
immitate pre-v5 format.  Will change this in the re-roll.

Just a heads up about my progress, I'm currently trying out your
suggestions about storing the directories as tree-format when reading
and storing the dummy entries in the index.  I'll make some experiments
to see if the change is worth it, and will then send a re-roll, probably
in a couple of weeks.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/19] read-cache: read index-v5
  2013-07-14  4:42   ` Duy Nguyen
@ 2013-08-07  8:13     ` Thomas Gummerer
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-08-07  8:13 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

Duy Nguyen <pclouds@gmail.com> writes:

> On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> +struct directory_entry {
>> +       struct directory_entry *next;
>> +       struct directory_entry *next_hash;
>> +       struct cache_entry *ce;
>> +       struct cache_entry *ce_last;
>> +       struct conflict_entry *conflict;
>> +       struct conflict_entry *conflict_last;
>> +       unsigned int conflict_size;
>> +       unsigned int de_foffset;
>> +       unsigned int de_cr;
>> +       unsigned int de_ncr;
>> +       unsigned int de_nsubtrees;
>> +       unsigned int de_nfiles;
>> +       unsigned int de_nentries;
>> +       unsigned char sha1[20];
>> +       unsigned short de_flags;
>> +       unsigned int de_pathlen;
>> +       char pathname[FLEX_ARRAY];
>> +};
>> +
>> +struct conflict_part {
>> +       struct conflict_part *next;
>> +       unsigned short flags;
>> +       unsigned short entry_mode;
>> +       unsigned char sha1[20];
>> +};
>> +
>> +struct conflict_entry {
>> +       struct conflict_entry *next;
>> +       unsigned int nfileconflicts;
>> +       struct conflict_part *entries;
>> +       unsigned int namelen;
>> +       unsigned int pathlen;
>> +       char name[FLEX_ARRAY];
>> +};
>> +
>> +struct ondisk_conflict_part {
>> +       unsigned short flags;
>> +       unsigned short entry_mode;
>> +       unsigned char sha1[20];
>> +};
>
> These new structs should probably be in read-cache-v5.c, or read-cache.h

Makes sense, thanks.

>>  #define cache_entry_size(len) (offsetof(struct cache_entry,name) + (len) + 1)
>> +#define directory_entry_size(len) (offsetof(struct directory_entry,pathname) + (len) + 1)
>> +#define conflict_entry_size(len) (offsetof(struct conflict_entry,name) + (len) + 1)
>
> These are used by read-cache-v5.c only so far. I'd say move them to
> read-cache.h or read-cache-v5.c together with the new structs.

Thanks.

>> +struct ondisk_cache_entry {
>> +       unsigned short flags;
>> +       unsigned short mode;
>> +       struct cache_time mtime;
>> +       unsigned int size;
>> +       int stat_crc;
>> +       unsigned char sha1[20];
>> +};
>> +
>> +struct ondisk_directory_entry {
>> +       unsigned int foffset;
>> +       unsigned int cr;
>> +       unsigned int ncr;
>> +       unsigned int nsubtrees;
>> +       unsigned int nfiles;
>> +       unsigned int nentries;
>> +       unsigned char sha1[20];
>> +       unsigned short flags;
>> +};
>
> Perhaps use uint32_t, uint16_t and friends for all on-disk structures?

We got this in the makefile, so I think we should be fine without it.
It still makes sense for clarity though I think.

         ifdef NO_UINTMAX_T
	       BASIC_CFLAGS += -Duintmax_t=uint32_t
         endif

While at it I'll make the code for v[234] use them too.

>> +static struct directory_entry *read_directories(unsigned int *dir_offset,
>> +                               unsigned int *dir_table_offset,
>> +                               void *mmap,
>> +                               int mmap_size)
>> +{
>> +       int i, ondisk_directory_size;
>> +       uint32_t *filecrc, *beginning, *end;
>> +       struct directory_entry *current = NULL;
>> +       struct ondisk_directory_entry *disk_de;
>> +       struct directory_entry *de;
>> +       unsigned int data_len, len;
>> +       char *name;
>> +
>> +       /*
>> +        * Length of pathname + nul byte for termination + size of
>> +        * members of ondisk_directory_entry. (Just using the size
>> +        * of the struct doesn't work, because there may be padding
>> +        * bytes for the struct)
>> +        */
>> +       ondisk_directory_size = sizeof(disk_de->flags)
>> +               + sizeof(disk_de->foffset)
>> +               + sizeof(disk_de->cr)
>> +               + sizeof(disk_de->ncr)
>> +               + sizeof(disk_de->nsubtrees)
>> +               + sizeof(disk_de->nfiles)
>> +               + sizeof(disk_de->nentries)
>> +               + sizeof(disk_de->sha1);
>> +       name = ptr_add(mmap, *dir_offset);
>> +       beginning = ptr_add(mmap, *dir_table_offset);
>> +       end = ptr_add(mmap, *dir_table_offset + 4);
>> +       len = ntoh_l(*end) - ntoh_l(*beginning) - ondisk_directory_size - 5;
>> +       disk_de = ptr_add(mmap, *dir_offset + len + 1);
>> +       de = directory_entry_from_ondisk(disk_de, name, len);
>> +       de->next = NULL;
>> +
>> +       data_len = len + 1 + ondisk_directory_size;
>> +       filecrc = ptr_add(mmap, *dir_offset + data_len);
>> +       if (!check_crc32(0, ptr_add(mmap, *dir_offset), data_len, ntoh_l(*filecrc)))
>> +               goto unmap;
>> +
>> +       *dir_table_offset += 4;
>> +       *dir_offset += data_len + 4; /* crc code */
>> +
>> +       current = de;
>> +       for (i = 0; i < de->de_nsubtrees; i++) {
>> +               current->next = read_directories(dir_offset, dir_table_offset,
>> +                                               mmap, mmap_size);
>> +               while (current->next)
>> +                       current = current->next;
>> +       }
>> +
>> +       return de;
>> +unmap:
>> +       munmap(mmap, mmap_size);
>> +       die("directory crc doesn't match for '%s'", de->pathname);
>> +}
>
> You don't have to munmap when you die() anway.

Will change that in the re-roll.

> I'm not sure if flatten
> the directory hierarchy into a list (linked by next pointer) is a good
> idea, or we should maintain the tree structure in memory. Still a lot
> of reading to figure that out..
>
> I skipped from here..
>
>> +static void ce_queue_push(struct cache_entry **head,
>> +                            struct cache_entry **tail,
>> +                            struct cache_entry *ce)
>> +{
>
> ...
>
>> +static int read_conflicts(struct conflict_entry **head,
>> +                         struct directory_entry *de,
>> +                         void **mmap, unsigned long mmap_size)
>> +{
>
> till the end of this function. Not interested in conflict stuff yet.
>
>
>> +static struct directory_entry *read_head_directories(struct index_state *istate,
>> +                                                    unsigned int *entry_offset,
>> +                                                    unsigned int *foffsetblock,
>> +                                                    unsigned int *ndirs,
>> +                                                    void *mmap, unsigned long mmap_size)
>> +{
>
> Maybe read_all_directories is a better nam.

Makes sense, thanks.

>> +static int read_index_filtered_v5(struct index_state *istate, void *mmap,
>> +                                 unsigned long mmap_size, struct filter_opts *opts)
>> +{
>> +       unsigned int entry_offset, ndirs, foffsetblock, nr = 0;
>> +       struct directory_entry *root_directory, *de;
>> +       const char **adjusted_pathspec = NULL;
>> +       int need_root = 0, i, n;
>> +       char *oldpath, *seen;
>> +
>> +       ...
>> +
>> +       de = root_directory;
>> +       while (de) {
>> +               if (need_root ||
>> +                   match_pathspec(adjusted_pathspec, de->pathname, de->de_pathlen, 0, NULL)) {
>> +                       unsigned int subdir_foffsetblock = de->de_foffset + foffsetblock;
>> +                       unsigned int *off = mmap + subdir_foffsetblock;
>> +                       unsigned int subdir_entry_offset = entry_offset + ntoh_l(*off);
>> +                       oldpath = de->pathname;
>> +                       do {
>> +                               if (read_entries(istate, &de, &subdir_entry_offset,
>> +                                                &mmap, mmap_size, &nr,
>> +                                                &subdir_foffsetblock) < 0)
>> +                                       return -1;
>> +                       } while (de && !prefixcmp(de->pathname, oldpath));
>> +               } else
>> +                       de = de->next;
>> +       }
>
> Hm.. if we maintain a tree structure here (one link to the first
> subdir, one link to the next sibling), I think the "do" loop could be
> done without prefixcmp. Just check if "de" returned by read_entries is
> the next sibling "de" (iow the end of current directory recursively).

Yes, the tree-structure makes sense.  I've implemented it a bit
differently though, instead of using two pointers, I'm using one pointer
to an array of directory entries, which can be iterated over.

>> +       istate->cache_nr = nr;
>> +       return 0;
>> +}
>> +
>> +static int read_index_v5(struct index_state *istate, void *mmap,
>> +                        unsigned long mmap_size, struct filter_opts *opts)
>> +{
>> +       unsigned int entry_offset, ndirs, foffsetblock, nr = 0;
>> +       struct directory_entry *root_directory, *de;
>> +
>> +       if (opts != NULL)
>> +               return read_index_filtered_v5(istate, mmap, mmap_size, opts);
>> +
>> +       root_directory = read_head_directories(istate, &entry_offset,
>> +                                              &foffsetblock, &ndirs,
>> +                                              mmap, mmap_size);
>> +       de = root_directory;
>> +       while (de)
>> +               if (read_entries(istate, &de, &entry_offset, &mmap,
>> +                                mmap_size, &nr, &foffsetblock) < 0)
>> +                       return -1;
>> +       istate->cache_nr = nr;
>> +       return 0;
>> +}
>
> Make it call read_index_filtered_v5 with an empty pathspec instead.
> match_pathspec* returns true immediately if pathspec is empty. Without
> the removal of prefixcmp() in the "do" loop mentioned above,
> read_index_filtered_v5 can't be more expensive than this version.

Yes right, will change in the re-roll.

> That was it! Lunch time! Maybe I'll read the rest in the afternoon, or
> someday next week.

Thanks a lot for taking the time to review my code.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/19] read-cache: read index-v5
  2013-07-15 10:12   ` Duy Nguyen
  2013-07-17  8:11     ` Thomas Gummerer
@ 2013-08-07  8:23     ` Thomas Gummerer
  2013-08-08  2:09       ` Duy Nguyen
  1 sibling, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-08-07  8:23 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

Duy Nguyen <pclouds@gmail.com> writes:

> A little bit more..
>
> On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> +static void ce_queue_push(struct cache_entry **head,
>> +                            struct cache_entry **tail,
>> +                            struct cache_entry *ce)
>> +{
>> +       if (!*head) {
>> +               *head = *tail = ce;
>> +               (*tail)->next = NULL;
>> +               return;
>> +       }
>> +
>> +       (*tail)->next = ce;
>> +       ce->next = NULL;
>> +       *tail = (*tail)->next;
>
> No no no. "next" is for name-hash.c don't "reuse" it here. And I don't
> think you really need to maintain a linked list of cache_entries by
> the way. More on read_entries comments below..

You're right, I don't need it for the reading code.  I do need to keep a
list of cache-entries for writing though, where a linked list is best
suited for the task.  I'll use a next_ce pointer for that.

>> +}
>> +
>> +static struct cache_entry *ce_queue_pop(struct cache_entry **head)
>> +{
>> +       struct cache_entry *ce;
>> +
>> +       ce = *head;
>> +       *head = (*head)->next;
>> +       return ce;
>> +}
>
> Same as ce_queue_push.
>
>> +static int read_entries(struct index_state *istate, struct directory_entry **de,
>> +                       unsigned int *entry_offset, void **mmap,
>> +                       unsigned long mmap_size, unsigned int *nr,
>> +                       unsigned int *foffsetblock)
>> +{
>> +       struct cache_entry *head = NULL, *tail = NULL;
>> +       struct conflict_entry *conflict_queue;
>> +       struct cache_entry *ce;
>> +       int i;
>> +
>> +       conflict_queue = NULL;
>> +       if (read_conflicts(&conflict_queue, *de, mmap, mmap_size) < 0)
>> +               return -1;
>> +       for (i = 0; i < (*de)->de_nfiles; i++) {
>> +               if (read_entry(&ce,
>> +                              *de,
>> +                              entry_offset,
>> +                              mmap,
>> +                              mmap_size,
>> +                              foffsetblock) < 0)
>> +                       return -1;
>> +               ce_queue_push(&head, &tail, ce);
>> +               *foffsetblock += 4;
>> +
>> +               /*
>> +                * Add the conflicted entries at the end of the index file
>> +                * to the in memory format
>> +                */
>> +               if (conflict_queue &&
>> +                   (conflict_queue->entries->flags & CONFLICT_CONFLICTED) != 0 &&
>> +                   !cache_name_compare(conflict_queue->name, conflict_queue->namelen,
>> +                                       ce->name, ce_namelen(ce))) {
>> +                       struct conflict_part *cp;
>> +                       cp = conflict_queue->entries;
>> +                       cp = cp->next;
>> +                       while (cp) {
>> +                               ce = convert_conflict_part(cp,
>> +                                               conflict_queue->name,
>> +                                               conflict_queue->namelen);
>> +                               ce_queue_push(&head, &tail, ce);
>> +                               conflict_part_head_remove(&cp);
>> +                       }
>> +                       conflict_entry_head_remove(&conflict_queue);
>> +               }
>
> I start to wonder if separating staged entries is a good idea. It
> seems to make the code more complicated. The good point about conflict
> section at the end of the file is you can just truncate() it out.
> Another way is putting staged entries in fileentries, sorted
> alphabetically then by stage number, and a flag indicating if the
> entry is valid. When you remove resolve an entry, just set the flag to
> invalid (partial write), so that read code will skip it.
>
> I think this approach is reasonably cheap (unless there are a lot of
> conflicts) and it simplifies this piece of code. truncate() may be
> overrated anyway. In my experience, I "git add <path>" as soon as I
> resolve <path> (so that "git diff" shrinks). One entry at a time, one
> index write at a time. I don't think I ever resolve everything then
> "git add -u .", which is where truncate() shines because staged
> entries are removed all at once. We should optimize for one file
> resolution at a time, imo.
>
>> +       }
>> +
>> +       *de = (*de)->next;
>> +
>> +       while (head) {
>> +               if (*de != NULL
>> +                   && strcmp(head->name, (*de)->pathname) > 0) {
>> +                       read_entries(istate,
>> +                                    de,
>> +                                    entry_offset,
>> +                                    mmap,
>> +                                    mmap_size,
>> +                                    nr,
>> +                                    foffsetblock);
>> +               } else {
>> +                       ce = ce_queue_pop(&head);
>> +                       set_index_entry(istate, *nr, ce);
>> +                       (*nr)++;
>> +                       ce->next = NULL;
>> +               }
>
> In this loop, you do some sort of merge sort combining a list of
> sorted files and a list of sorted directories (which will be expanded
> to bunches of files by the recursive read_entries). strcmp() does two
> things. Assume we're in directory 'a', it makes sure that subdirectory
> 'a/b' is only inserted after file 'a/a' is inserted to maintain index
> sort order. It also makes sure that 'de' of an outside directory 'b'
> will not be expanded here. strcmp will be called for every file,
> checking _full_ path. Sounds expensive.
>
> If you maintain two pointers first_subdir and next_sibling in "de" as
> in my previous mail, you know "de" 'b' is out without string
> comparison (de->next_sibling would be NULL). With that, the purpose of
> strcmp is simply making sure that 'a/b' is inserted after 'a/a'.
> Because we know all possible "de" is a subdirectory of 'a', we don't
> need to compare the prefix 'a/'. We still need to strcmp on every
> file, but we only need to compare the file name, not the leading
> directory anymore. Cheaper.
>
> Going further, I wonder if we could eliminate strcmp completely. We
> could store dummy entries in fileentries to mark the positions of the
> subdirectories. When we encounter a dummy entry, we call
> read_entries() instead of set_index_entry.

I've tried implementing both versions with the internal tree structure,
see below for timings.

simpleapi is the code that was posted to the list, HEAD uses a
tree-structure for the directories internally, and replaces strcmp with
cache_name_compare and leaves out the prefix.  This tree uses dummy
entries for the sub-directories.

Dummy entries are single NUL-bytes mixed with the cache-entries, which
should give optimal performance.  I haven't thought much about
corruption of the index, but I don't think we would need any additional
crc checksums or similar.

The performance advantage seems very slim to none when using the dummy
entries to avoid the comparison though, so I don't think it makes sense
to make the index more complicated for a very small speed-up.

Test                                        simpleapi         HEAD                     this tree             
-------------------------------------------------------------------------------------------------------------
0003.2: v[23]: update-index                 3.22(2.61+0.58)   3.20(2.56+0.61) -0.6%    3.22(2.67+0.53) +0.0% 
0003.3: v[23]: grep nonexistent -- subdir   1.65(1.36+0.27)   1.65(1.34+0.30) +0.0%    1.67(1.34+0.31) +1.2% 
0003.4: v[23]: ls-files -- subdir           1.50(1.20+0.29)   1.54(1.18+0.34) +2.7%    1.53(1.22+0.30) +2.0% 
0003.6: v4: update-index                    2.69(2.28+0.39)   2.70(2.21+0.47) +0.4%    2.70(2.27+0.41) +0.4% 
0003.7: v4: grep nonexistent -- subdir      1.33(0.98+0.34)   1.36(1.01+0.33) +2.3%    1.34(1.03+0.30) +0.8% 
0003.8: v4: ls-files -- subdir              1.21(0.86+0.33)   1.23(0.91+0.30) +1.7%    1.22(0.90+0.30) +0.8% 
0003.10: v5: update-index                   2.12(1.58+0.52)   2.20(1.67+0.52) +3.8%    2.19(1.66+0.51) +3.3% 
0003.11: v5: ls-files                       1.57(1.21+0.34)   1.55(1.20+0.33) -1.3%    1.52(1.21+0.29) -3.2% 
0003.12: v5: grep nonexistent -- subdir     0.08(0.06+0.01)   0.07(0.04+0.02) -12.5%   0.07(0.04+0.02) -12.5%
0003.13: v5: ls-files -- subdir             0.07(0.04+0.01)   0.06(0.05+0.00) -14.3%   0.07(0.05+0.01) +0.0% 


> If you merge the "read_entry" for loop in this while loop, you don't
> need to maintain a linked list of ce to pop out one by one here.

Makes sense, I'll do that.

>> +       }
>> +       return 0;
>> +}
>> +
> --
> Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/19] read-cache: read index-v5
  2013-07-17  8:11     ` Thomas Gummerer
@ 2013-08-08  2:00       ` Duy Nguyen
  2013-08-08 13:28         ` Thomas Gummerer
  2013-08-09 13:10         ` Thomas Gummerer
  0 siblings, 2 replies; 51+ messages in thread
From: Duy Nguyen @ 2013-08-08  2:00 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

On Wed, Jul 17, 2013 at 3:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> Duy Nguyen <pclouds@gmail.com> writes:
>
> [..]
>
>>> +static int read_entries(struct index_state *istate, struct directory_entry **de,
>>> +                       unsigned int *entry_offset, void **mmap,
>>> +                       unsigned long mmap_size, unsigned int *nr,
>>> +                       unsigned int *foffsetblock)
>>> +{
>>> +       struct cache_entry *head = NULL, *tail = NULL;
>>> +       struct conflict_entry *conflict_queue;
>>> +       struct cache_entry *ce;
>>> +       int i;
>>> +
>>> +       conflict_queue = NULL;
>>> +       if (read_conflicts(&conflict_queue, *de, mmap, mmap_size) < 0)
>>> +               return -1;
>>> +       for (i = 0; i < (*de)->de_nfiles; i++) {
>>> +               if (read_entry(&ce,
>>> +                              *de,
>>> +                              entry_offset,
>>> +                              mmap,
>>> +                              mmap_size,
>>> +                              foffsetblock) < 0)
>>> +                       return -1;
>>> +               ce_queue_push(&head, &tail, ce);
>>> +               *foffsetblock += 4;
>>> +
>>> +               /*
>>> +                * Add the conflicted entries at the end of the index file
>>> +                * to the in memory format
>>> +                */
>>> +               if (conflict_queue &&
>>> +                   (conflict_queue->entries->flags & CONFLICT_CONFLICTED) != 0 &&
>>> +                   !cache_name_compare(conflict_queue->name, conflict_queue->namelen,
>>> +                                       ce->name, ce_namelen(ce))) {
>>> +                       struct conflict_part *cp;
>>> +                       cp = conflict_queue->entries;
>>> +                       cp = cp->next;
>>> +                       while (cp) {
>>> +                               ce = convert_conflict_part(cp,
>>> +                                               conflict_queue->name,
>>> +                                               conflict_queue->namelen);
>>> +                               ce_queue_push(&head, &tail, ce);
>>> +                               conflict_part_head_remove(&cp);
>>> +                       }
>>> +                       conflict_entry_head_remove(&conflict_queue);
>>> +               }
>>
>> I start to wonder if separating staged entries is a good idea. It
>> seems to make the code more complicated. The good point about conflict
>> section at the end of the file is you can just truncate() it out.
>> Another way is putting staged entries in fileentries, sorted
>> alphabetically then by stage number, and a flag indicating if the
>> entry is valid. When you remove resolve an entry, just set the flag to
>> invalid (partial write), so that read code will skip it.
>>
>> I think this approach is reasonably cheap (unless there are a lot of
>> conflicts) and it simplifies this piece of code. truncate() may be
>> overrated anyway. In my experience, I "git add <path>" as soon as I
>> resolve <path> (so that "git diff" shrinks). One entry at a time, one
>> index write at a time. I don't think I ever resolve everything then
>> "git add -u .", which is where truncate() shines because staged
>> entries are removed all at once. We should optimize for one file
>> resolution at a time, imo.
>
> Thanks for your comments.  I'll address the other ones once we decided
> to do with the conflicts.
>
> It does make the code quite a bit more complicated, but also has one
> advantage that you overlooked.

I did overlook, although my goal is to keep the code simpler, not more
comlicated. The thinking is if we can find everything in fileentries
table, the code here is simplified, so..

> We wouldn't truncate() when resolving
> the conflicts.  The resolve undo data is stored with the conflicts and
> therefore we could just flip a bit and set the stage of the cache-entry
> in the main index to 0 (always once we got partial writing).  This can
> be fast both in case we resolve one entry at a time and when we resolve
> a lot of entries.  The advantage is even bigger when we resolve one
> entry at a time, when we otherwise would have to re-write the index for
> each conflict resolution.

If I understand it correctly, filentries can only contain stage-0 or
stage-1 entries, "stage > 0" entries are stored in conflict data. Once
a conflict is solved, you update the stage-1 entry in fileentries,
turning it to stage-0 and recalculate the entry checksum. Conflict
data remains there to function as the old REUC extension. Correct?

First of all, if that's true, we only need 1 bit for stage in fileentries table.

Secondly, you may get away with looking up to conflict data in this
function by storing all stages in fileentries (now we need 2-bit
stage), replicated in conflict data for reuc function. When you
resolve conflict, you flip stage-1 to stage-0, and flip (a new bit) to
mark stage-2 entry invalid so the code knows to skip it. Next time the
index is rewritten, invalid entries are removed, but we still have old
stage entries in conflict data. The flipping business is pretty much
what you plan anyway, but the reading code does not need to look at
both fileentries and conflict data at the same time.

What do you think?
-- 
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/19] read-cache: read index-v5
  2013-08-07  8:23     ` Thomas Gummerer
@ 2013-08-08  2:09       ` Duy Nguyen
  0 siblings, 0 replies; 51+ messages in thread
From: Duy Nguyen @ 2013-08-08  2:09 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

On Wed, Aug 7, 2013 at 3:23 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> On Sat, Jul 13, 2013 at 12:26 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>>> +static void ce_queue_push(struct cache_entry **head,
>>> +                            struct cache_entry **tail,
>>> +                            struct cache_entry *ce)
>>> +{
>>> +       if (!*head) {
>>> +               *head = *tail = ce;
>>> +               (*tail)->next = NULL;
>>> +               return;
>>> +       }
>>> +
>>> +       (*tail)->next = ce;
>>> +       ce->next = NULL;
>>> +       *tail = (*tail)->next;
>>
>> No no no. "next" is for name-hash.c don't "reuse" it here. And I don't
>> think you really need to maintain a linked list of cache_entries by
>> the way. More on read_entries comments below..
>
> You're right, I don't need it for the reading code.  I do need to keep a
> list of cache-entries for writing though, where a linked list is best
> suited for the task.  I'll use a next_ce pointer for that.

Hmm.. yeah you need to maintain temporary structures before writing
out direntries, then fileentries table. We can't just write as we
traverse through the cache in one go. Maybe a new field in cache_entry
for the linked list, or maintain the linked list off cache_entry..
whichever is easier for you.

> I've tried implementing both versions with the internal tree structure,
> see below for timings.
>
> simpleapi is the code that was posted to the list, HEAD uses a
> tree-structure for the directories internally, and replaces strcmp with
> cache_name_compare and leaves out the prefix.  This tree uses dummy
> entries for the sub-directories.
>
> Dummy entries are single NUL-bytes mixed with the cache-entries, which
> should give optimal performance.  I haven't thought much about
> corruption of the index, but I don't think we would need any additional
> crc checksums or similar.
>
> The performance advantage seems very slim to none when using the dummy
> entries to avoid the comparison though, so I don't think it makes sense
> to make the index more complicated for a very small speed-up.

Agreed.

> Test                                        simpleapi         HEAD                     this tree
> -------------------------------------------------------------------------------------------------------------
> 0003.2: v[23]: update-index                 3.22(2.61+0.58)   3.20(2.56+0.61) -0.6%    3.22(2.67+0.53) +0.0%
> 0003.3: v[23]: grep nonexistent -- subdir   1.65(1.36+0.27)   1.65(1.34+0.30) +0.0%    1.67(1.34+0.31) +1.2%
> 0003.4: v[23]: ls-files -- subdir           1.50(1.20+0.29)   1.54(1.18+0.34) +2.7%    1.53(1.22+0.30) +2.0%
> 0003.6: v4: update-index                    2.69(2.28+0.39)   2.70(2.21+0.47) +0.4%    2.70(2.27+0.41) +0.4%
> 0003.7: v4: grep nonexistent -- subdir      1.33(0.98+0.34)   1.36(1.01+0.33) +2.3%    1.34(1.03+0.30) +0.8%
> 0003.8: v4: ls-files -- subdir              1.21(0.86+0.33)   1.23(0.91+0.30) +1.7%    1.22(0.90+0.30) +0.8%
> 0003.10: v5: update-index                   2.12(1.58+0.52)   2.20(1.67+0.52) +3.8%    2.19(1.66+0.51) +3.3%
> 0003.11: v5: ls-files                       1.57(1.21+0.34)   1.55(1.20+0.33) -1.3%    1.52(1.21+0.29) -3.2%
> 0003.12: v5: grep nonexistent -- subdir     0.08(0.06+0.01)   0.07(0.04+0.02) -12.5%   0.07(0.04+0.02) -12.5%
> 0003.13: v5: ls-files -- subdir             0.07(0.04+0.01)   0.06(0.05+0.00) -14.3%   0.07(0.05+0.01) +0.0%
-- 
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/19] read-cache: read index-v5
  2013-08-08  2:00       ` Duy Nguyen
@ 2013-08-08 13:28         ` Thomas Gummerer
  2013-08-09 13:10         ` Thomas Gummerer
  1 sibling, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-08-08 13:28 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

Duy Nguyen <pclouds@gmail.com> writes:

> On Wed, Jul 17, 2013 at 3:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> Duy Nguyen <pclouds@gmail.com> writes:
>>
>> [..]
>>
>>>> +static int read_entries(struct index_state *istate, struct directory_entry **de,
>>>> +                       unsigned int *entry_offset, void **mmap,
>>>> +                       unsigned long mmap_size, unsigned int *nr,
>>>> +                       unsigned int *foffsetblock)
>>>> +{
>>>> +       struct cache_entry *head = NULL, *tail = NULL;
>>>> +       struct conflict_entry *conflict_queue;
>>>> +       struct cache_entry *ce;
>>>> +       int i;
>>>> +
>>>> +       conflict_queue = NULL;
>>>> +       if (read_conflicts(&conflict_queue, *de, mmap, mmap_size) < 0)
>>>> +               return -1;
>>>> +       for (i = 0; i < (*de)->de_nfiles; i++) {
>>>> +               if (read_entry(&ce,
>>>> +                              *de,
>>>> +                              entry_offset,
>>>> +                              mmap,
>>>> +                              mmap_size,
>>>> +                              foffsetblock) < 0)
>>>> +                       return -1;
>>>> +               ce_queue_push(&head, &tail, ce);
>>>> +               *foffsetblock += 4;
>>>> +
>>>> +               /*
>>>> +                * Add the conflicted entries at the end of the index file
>>>> +                * to the in memory format
>>>> +                */
>>>> +               if (conflict_queue &&
>>>> +                   (conflict_queue->entries->flags & CONFLICT_CONFLICTED) != 0 &&
>>>> +                   !cache_name_compare(conflict_queue->name, conflict_queue->namelen,
>>>> +                                       ce->name, ce_namelen(ce))) {
>>>> +                       struct conflict_part *cp;
>>>> +                       cp = conflict_queue->entries;
>>>> +                       cp = cp->next;
>>>> +                       while (cp) {
>>>> +                               ce = convert_conflict_part(cp,
>>>> +                                               conflict_queue->name,
>>>> +                                               conflict_queue->namelen);
>>>> +                               ce_queue_push(&head, &tail, ce);
>>>> +                               conflict_part_head_remove(&cp);
>>>> +                       }
>>>> +                       conflict_entry_head_remove(&conflict_queue);
>>>> +               }
>>>
>>> I start to wonder if separating staged entries is a good idea. It
>>> seems to make the code more complicated. The good point about conflict
>>> section at the end of the file is you can just truncate() it out.
>>> Another way is putting staged entries in fileentries, sorted
>>> alphabetically then by stage number, and a flag indicating if the
>>> entry is valid. When you remove resolve an entry, just set the flag to
>>> invalid (partial write), so that read code will skip it.
>>>
>>> I think this approach is reasonably cheap (unless there are a lot of
>>> conflicts) and it simplifies this piece of code. truncate() may be
>>> overrated anyway. In my experience, I "git add <path>" as soon as I
>>> resolve <path> (so that "git diff" shrinks). One entry at a time, one
>>> index write at a time. I don't think I ever resolve everything then
>>> "git add -u .", which is where truncate() shines because staged
>>> entries are removed all at once. We should optimize for one file
>>> resolution at a time, imo.
>>
>> Thanks for your comments.  I'll address the other ones once we decided
>> to do with the conflicts.
>>
>> It does make the code quite a bit more complicated, but also has one
>> advantage that you overlooked.
>
> I did overlook, although my goal is to keep the code simpler, not more
> comlicated. The thinking is if we can find everything in fileentries
> table, the code here is simplified, so..
>
>> We wouldn't truncate() when resolving
>> the conflicts.  The resolve undo data is stored with the conflicts and
>> therefore we could just flip a bit and set the stage of the cache-entry
>> in the main index to 0 (always once we got partial writing).  This can
>> be fast both in case we resolve one entry at a time and when we resolve
>> a lot of entries.  The advantage is even bigger when we resolve one
>> entry at a time, when we otherwise would have to re-write the index for
>> each conflict resolution.
>
> If I understand it correctly, filentries can only contain stage-0 or
> stage-1 entries, "stage > 0" entries are stored in conflict data. Once
> a conflict is solved, you update the stage-1 entry in fileentries,
> turning it to stage-0 and recalculate the entry checksum. Conflict
> data remains there to function as the old REUC extension. Correct?

Correct.

> First of all, if that's true, we only need 1 bit for stage in fileentries table.
>
> Secondly, you may get away with looking up to conflict data in this
> function by storing all stages in fileentries (now we need 2-bit
> stage), replicated in conflict data for reuc function. When you
> resolve conflict, you flip stage-1 to stage-0, and flip (a new bit) to
> mark stage-2 entry invalid so the code knows to skip it. Next time the
> index is rewritten, invalid entries are removed, but we still have old
> stage entries in conflict data. The flipping business is pretty much
> what you plan anyway, but the reading code does not need to look at
> both fileentries and conflict data at the same time.
>
> What do you think?

I think that's a good idea, thanks.  It may slow the index down a bit
when there are a lot of conflicts, but I think once we get to the point
that it slows the index down noticeably, the user has much bigger
problems anyway and will probably not try to resolve them all.

The invalid entry flag could also be used when entries are only removed
from the index, so we can just set that bit instead of re-writing the
whole index file.

I'll try to implement this idea to see if the experiments match the
expectations and then send a re-roll.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/19] read-cache: read index-v5
  2013-08-08  2:00       ` Duy Nguyen
  2013-08-08 13:28         ` Thomas Gummerer
@ 2013-08-09 13:10         ` Thomas Gummerer
  1 sibling, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-08-09 13:10 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg, Eric Sunshine

Duy Nguyen <pclouds@gmail.com> writes:

> On Wed, Jul 17, 2013 at 3:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> Duy Nguyen <pclouds@gmail.com> writes:
>>
>> [..]
>>
>>>> +static int read_entries(struct index_state *istate, struct directory_entry **de,
>>>> +                       unsigned int *entry_offset, void **mmap,
>>>> +                       unsigned long mmap_size, unsigned int *nr,
>>>> +                       unsigned int *foffsetblock)
>>>> +{
>>>> +       struct cache_entry *head = NULL, *tail = NULL;
>>>> +       struct conflict_entry *conflict_queue;
>>>> +       struct cache_entry *ce;
>>>> +       int i;
>>>> +
>>>> +       conflict_queue = NULL;
>>>> +       if (read_conflicts(&conflict_queue, *de, mmap, mmap_size) < 0)
>>>> +               return -1;
>>>> +       for (i = 0; i < (*de)->de_nfiles; i++) {
>>>> +               if (read_entry(&ce,
>>>> +                              *de,
>>>> +                              entry_offset,
>>>> +                              mmap,
>>>> +                              mmap_size,
>>>> +                              foffsetblock) < 0)
>>>> +                       return -1;
>>>> +               ce_queue_push(&head, &tail, ce);
>>>> +               *foffsetblock += 4;
>>>> +
>>>> +               /*
>>>> +                * Add the conflicted entries at the end of the index file
>>>> +                * to the in memory format
>>>> +                */
>>>> +               if (conflict_queue &&
>>>> +                   (conflict_queue->entries->flags & CONFLICT_CONFLICTED) != 0 &&
>>>> +                   !cache_name_compare(conflict_queue->name, conflict_queue->namelen,
>>>> +                                       ce->name, ce_namelen(ce))) {
>>>> +                       struct conflict_part *cp;
>>>> +                       cp = conflict_queue->entries;
>>>> +                       cp = cp->next;
>>>> +                       while (cp) {
>>>> +                               ce = convert_conflict_part(cp,
>>>> +                                               conflict_queue->name,
>>>> +                                               conflict_queue->namelen);
>>>> +                               ce_queue_push(&head, &tail, ce);
>>>> +                               conflict_part_head_remove(&cp);
>>>> +                       }
>>>> +                       conflict_entry_head_remove(&conflict_queue);
>>>> +               }
>>>
>>> I start to wonder if separating staged entries is a good idea. It
>>> seems to make the code more complicated. The good point about conflict
>>> section at the end of the file is you can just truncate() it out.
>>> Another way is putting staged entries in fileentries, sorted
>>> alphabetically then by stage number, and a flag indicating if the
>>> entry is valid. When you remove resolve an entry, just set the flag to
>>> invalid (partial write), so that read code will skip it.
>>>
>>> I think this approach is reasonably cheap (unless there are a lot of
>>> conflicts) and it simplifies this piece of code. truncate() may be
>>> overrated anyway. In my experience, I "git add <path>" as soon as I
>>> resolve <path> (so that "git diff" shrinks). One entry at a time, one
>>> index write at a time. I don't think I ever resolve everything then
>>> "git add -u .", which is where truncate() shines because staged
>>> entries are removed all at once. We should optimize for one file
>>> resolution at a time, imo.
>>
>> Thanks for your comments.  I'll address the other ones once we decided
>> to do with the conflicts.
>>
>> It does make the code quite a bit more complicated, but also has one
>> advantage that you overlooked.
>
> I did overlook, although my goal is to keep the code simpler, not more
> comlicated. The thinking is if we can find everything in fileentries
> table, the code here is simplified, so..
>
>> We wouldn't truncate() when resolving
>> the conflicts.  The resolve undo data is stored with the conflicts and
>> therefore we could just flip a bit and set the stage of the cache-entry
>> in the main index to 0 (always once we got partial writing).  This can
>> be fast both in case we resolve one entry at a time and when we resolve
>> a lot of entries.  The advantage is even bigger when we resolve one
>> entry at a time, when we otherwise would have to re-write the index for
>> each conflict resolution.
>
> If I understand it correctly, filentries can only contain stage-0 or
> stage-1 entries, "stage > 0" entries are stored in conflict data. Once
> a conflict is solved, you update the stage-1 entry in fileentries,
> turning it to stage-0 and recalculate the entry checksum. Conflict
> data remains there to function as the old REUC extension. Correct?
>
> First of all, if that's true, we only need 1 bit for stage in fileentries table.
>
> Secondly, you may get away with looking up to conflict data in this
> function by storing all stages in fileentries (now we need 2-bit
> stage), replicated in conflict data for reuc function. When you
> resolve conflict, you flip stage-1 to stage-0, and flip (a new bit) to
> mark stage-2 entry invalid so the code knows to skip it. Next time the
> index is rewritten, invalid entries are removed, but we still have old
> stage entries in conflict data. The flipping business is pretty much
> what you plan anyway, but the reading code does not need to look at
> both fileentries and conflict data at the same time.
>
> What do you think?

I've now tried it out for my synthetic repository, and created ~115,000
conflicts in it.  It's a only a little slower even for that large number
of conflicts (which I don't think will ever happen in practice) but the
code is definitely simpler, so I will go with this.  The times for the
old and new version are below.

Test                                        HEAD~1            this tree
-----------------------------------------------------------------------------------
0003.2: v[23]: update-index                 3.50(2.87+0.61)   3.52(2.89+0.60) +0.6%
0003.3: v[23]: grep nonexistent -- subdir   1.80(1.43+0.35)   1.86(1.50+0.34) +3.3%
0003.4: v[23]: ls-files -- subdir           1.67(1.29+0.36)   1.70(1.31+0.37) +1.8%
0003.6: v4: update-index                    2.97(2.44+0.51)   3.00(2.50+0.48) +1.0%
0003.7: v4: grep nonexistent -- subdir      1.45(1.12+0.31)   1.48(1.06+0.40) +2.1%
0003.8: v4: ls-files -- subdir              1.33(0.99+0.33)   1.35(1.02+0.32) +1.5%
0003.10: v5: update-index                   2.42(1.87+0.54)   2.50(1.84+0.63) +3.3%
0003.11: v5: ls-files                       1.75(1.38+0.35)   1.80(1.37+0.41) +2.9%
0003.12: v5: grep nonexistent -- subdir     0.07(0.05+0.01)   0.07(0.05+0.01) +0.0%
0003.13: v5: ls-files -- subdir             0.07(0.05+0.01)   0.07(0.06+0.00) +0.0%

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2013-08-09 13:10 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-12 17:26 [PATCH v2 00/19] Index-v5 Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 01/19] t2104: Don't fail for index versions other than [23] Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 02/19] read-cache: split index file version specific functionality Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 03/19] read-cache: move index v2 specific functions to their own file Thomas Gummerer
2013-07-14  3:10   ` Duy Nguyen
2013-07-19 14:53     ` Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 04/19] read-cache: Re-read index if index file changed Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 05/19] Add documentation for the index api Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 06/19] read-cache: add index reading api Thomas Gummerer
2013-07-14  3:21   ` Duy Nguyen
2013-07-12 17:26 ` [PATCH v2 07/19] make sure partially read index is not changed Thomas Gummerer
2013-07-14  3:29   ` Duy Nguyen
2013-07-17 12:56     ` Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 08/19] grep.c: Use index api Thomas Gummerer
2013-07-14  3:32   ` Duy Nguyen
2013-07-15  9:51     ` Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 09/19] ls-files.c: use " Thomas Gummerer
2013-07-14  3:39   ` Duy Nguyen
2013-07-17  8:07     ` Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 10/19] documentation: add documentation of the index-v5 file format Thomas Gummerer
2013-07-14  3:59   ` Duy Nguyen
2013-07-17  8:09     ` Thomas Gummerer
2013-08-04 11:26   ` Duy Nguyen
2013-08-04 17:58     ` Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 11/19] read-cache: make in-memory format aware of stat_crc Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 12/19] read-cache: read index-v5 Thomas Gummerer
2013-07-14  4:42   ` Duy Nguyen
2013-08-07  8:13     ` Thomas Gummerer
2013-07-15 10:12   ` Duy Nguyen
2013-07-17  8:11     ` Thomas Gummerer
2013-08-08  2:00       ` Duy Nguyen
2013-08-08 13:28         ` Thomas Gummerer
2013-08-09 13:10         ` Thomas Gummerer
2013-08-07  8:23     ` Thomas Gummerer
2013-08-08  2:09       ` Duy Nguyen
2013-07-12 17:26 ` [PATCH v2 13/19] read-cache: read resolve-undo data Thomas Gummerer
2013-07-12 17:26 ` [PATCH v2 14/19] read-cache: read cache-tree in index-v5 Thomas Gummerer
2013-07-12 17:27 ` [PATCH v2 15/19] read-cache: write index-v5 Thomas Gummerer
2013-07-12 17:27 ` [PATCH v2 16/19] read-cache: write index-v5 cache-tree data Thomas Gummerer
2013-07-12 17:27 ` [PATCH v2 17/19] read-cache: write resolve-undo data for index-v5 Thomas Gummerer
2013-07-12 17:27 ` [PATCH v2 18/19] update-index.c: rewrite index when index-version is given Thomas Gummerer
2013-07-12 17:27 ` [PATCH v2 19/19] p0003-index.sh: add perf test for the index formats Thomas Gummerer
2013-07-14  2:59 ` [PATCH v2 00/19] Index-v5 Duy Nguyen
2013-07-15  9:30   ` Thomas Gummerer
2013-07-15  9:38     ` Duy Nguyen
2013-07-17  8:12       ` Thomas Gummerer
2013-07-17 23:58         ` Junio C Hamano
2013-07-19 17:37           ` Thomas Gummerer
2013-07-19 18:25             ` Junio C Hamano
2013-07-16 21:03 ` Ramsay Jones
2013-07-17  8:04   ` Thomas Gummerer

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).