git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / Atom feed
* [PATCH 00/22] Index v5
@ 2013-07-07  8:11 Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 01/22] t2104: Don't fail for index versions other than [23] Thomas Gummerer
                   ` (21 more replies)
  0 siblings, 22 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Hi,

This is a follow up for last years Google Summer of Code (late I know
:-) ), which wasn't merged back then.  The previous rounds of the
series are at $gmane/202752, $gmane/202923, $gmane/203088 and
$gmane/203517.

Since then I added a index reading api, which allows certain parts of
Git to take advantage of the the partial reading capability of the new
index file format now.  In this series the grep and the ls-files and
the code-paths used by them are switched to the new api.

Another goal for the api is to hide the open coded loops and accesses
to the in-memory format, to make it simpler to change the in-memory
format to a version that fits the new on-disk format better.

Except for the new patches, mostly the "read-cache: read index-v5"
patch changed, as the possibility to read the index partially was
added.

The first patch for t2104 makes sense without the rest of the series,
as it fixes running the test-suite with index-v4 as the default index
format.

Below are the timings for the WebKit repository.  c4b2d88 is the
revicion before adding anything, while HEAD are the times at the last
patch in the series.  The slower times in update-index come from the
update-index patch so they are no problem (in c4b2d88 the index is
only read, while in HEAD it's read and written).  The increase in time
in the ls-files test come from the not having the prune_cache function
in the index api.

I have not added this function as it only seems of use in ls-files,
but it can still be added if this increase is a problem.

Test                                        c4b2d88           HEAD                   
-------------------------------------------------------------------------------------
0003.2: v[23]: update-index                 0.11(0.06+0.04)   0.22(0.15+0.05) +100.0%
0003.3: v[23]: grep nonexistent -- subdir   0.12(0.08+0.03)   0.12(0.09+0.02) +0.0%  
0003.4: v[23]: ls-files -- subdir           0.11(0.08+0.01)   0.12(0.08+0.03) +9.1%  
0003.6: v4: update-index                    0.09(0.06+0.02)   0.18(0.14+0.03) +100.0%
0003.7: v4: grep nonexistent -- subdir      0.10(0.08+0.02)   0.10(0.07+0.02) +0.0%  
0003.8: v4: ls-files -- subdir              0.09(0.07+0.01)   0.10(0.08+0.01) +11.1% 
0003.10: v5: update-index                   <missing>         0.15(0.10+0.03)        
0003.11: v5: grep nonexistent -- subdir     <missing>         0.01(0.00+0.00)        
0003.12: v5: ls-files -- subdir             <missing>         0.01(0.01+0.00)        

And for reference the times for a synthetic repository with a 470MB
index file, just to demonstrate the improvements in large repositories.

Test                                        c4b2d88           HEAD                   
-------------------------------------------------------------------------------------
0003.2: v[23]: update-index                 1.50(1.18+0.30)   3.18(2.55+0.60) +112.0%
0003.3: v[23]: grep nonexistent -- subdir   1.62(1.28+0.32)   1.66(1.28+0.36) +2.5%  
0003.4: v[23]: ls-files -- subdir           1.49(1.21+0.26)   1.62(1.28+0.32) +8.7%  
0003.6: v4: update-index                    1.18(0.89+0.28)   2.68(2.22+0.44) +127.1%
0003.7: v4: grep nonexistent -- subdir      1.29(1.00+0.28)   1.30(1.04+0.24) +0.8%  
0003.8: v4: ls-files -- subdir              1.20(0.95+0.23)   1.30(0.98+0.30) +8.3%  
0003.10: v5: update-index                   <missing>         2.12(1.63+0.48)        
0003.11: v5: grep nonexistent -- subdir     <missing>         0.08(0.04+0.02)        
0003.12: v5: ls-files -- subdir             <missing>         0.07(0.05+0.01)        


Thomas Gummerer (21):
  t2104: Don't fail for index versions other than [23]
  read-cache: split index file version specific functionality
  read-cache: move index v2 specific functions to their own file
  read-cache: Re-read index if index file changed
  read-cache: add index reading api
  make sure partially read index is not changed
  dir.c: use index api
  tree.c: use index api
  name-hash.c: use index api
  grep.c: Use index api
  ls-files.c: use the index api
  read-cache: make read_blob_data_from_index use index api
  documentation: add documentation of the index-v5 file format
  read-cache: make in-memory format aware of stat_crc
  read-cache: read index-v5
  read-cache: read resolve-undo data
  read-cache: read cache-tree in index-v5
  read-cache: write index-v5
  read-cache: write index-v5 cache-tree data
  read-cache: write resolve-undo data for index-v5
  update-index.c: rewrite index when index-version is given

Thomas Rast (1):
  p0003-index.sh: add perf test for the index formats

 Documentation/technical/index-file-format-v5.txt |  296 +++++
 Makefile                                         |    3 +
 builtin/grep.c                                   |   71 +-
 builtin/ls-files.c                               |  213 ++-
 builtin/update-index.c                           |    8 +-
 cache-tree.c                                     |    2 +-
 cache-tree.h                                     |    6 +
 cache.h                                          |  158 ++-
 dir.c                                            |   33 +-
 name-hash.c                                      |   11 +-
 read-cache-v2.c                                  |  651 +++++++++
 read-cache-v5.c                                  | 1536 ++++++++++++++++++++++
 read-cache.c                                     |  752 ++++-------
 read-cache.h                                     |   69 +
 t/perf/p0003-index.sh                            |   59 +
 t/t2104-update-index-skip-worktree.sh            |    1 +
 test-index-version.c                             |    7 +-
 tree.c                                           |   38 +-
 18 files changed, 3183 insertions(+), 731 deletions(-)
 create mode 100644 Documentation/technical/index-file-format-v5.txt
 create mode 100644 read-cache-v2.c
 create mode 100644 read-cache-v5.c
 create mode 100644 read-cache.h
 create mode 100755 t/perf/p0003-index.sh

-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 01/22] t2104: Don't fail for index versions other than [23]
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 02/22] read-cache: split index file version specific functionality Thomas Gummerer
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

t2104 currently checks for the exact index version 2 or 3,
depending if there is a skip-worktree flag or not. Other
index versions do not use extended flags and thus cannot
be tested for version changes.

Make this test update the index to version 2 at the beginning
of the test. Testing the skip-worktree flags for the default
index format is still covered by t7011 and t7012.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 t/t2104-update-index-skip-worktree.sh | 1 +
 1 file changed, 1 insertion(+)

diff --git a/t/t2104-update-index-skip-worktree.sh b/t/t2104-update-index-skip-worktree.sh
index 1d0879b..bd9644f 100755
--- a/t/t2104-update-index-skip-worktree.sh
+++ b/t/t2104-update-index-skip-worktree.sh
@@ -22,6 +22,7 @@ H sub/2
 EOF
 
 test_expect_success 'setup' '
+	git update-index --index-version=2 &&
 	mkdir sub &&
 	touch ./1 ./2 sub/1 sub/2 &&
 	git add 1 2 sub/1 sub/2 &&
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 02/22] read-cache: split index file version specific functionality
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 01/22] t2104: Don't fail for index versions other than [23] Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 03/22] read-cache: move index v2 specific functions to their own file Thomas Gummerer
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Split index file version specific functionality to their own functions,
to prepare for moving the index file version specific parts to their own
file.  This makes it easier to add a new index file format later.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 cache.h              |   5 +-
 read-cache.c         | 130 +++++++++++++++++++++++++++++++++------------------
 test-index-version.c |   2 +-
 3 files changed, 90 insertions(+), 47 deletions(-)

diff --git a/cache.h b/cache.h
index c288678..7af853b 100644
--- a/cache.h
+++ b/cache.h
@@ -100,9 +100,12 @@ unsigned long git_deflate_bound(git_zstream *, unsigned long);
  */
 
 #define CACHE_SIGNATURE 0x44495243	/* "DIRC" */
-struct cache_header {
+struct cache_version_header {
 	unsigned int hdr_signature;
 	unsigned int hdr_version;
+};
+
+struct cache_header {
 	unsigned int hdr_entries;
 };
 
diff --git a/read-cache.c b/read-cache.c
index d5201f9..93947bf 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1268,10 +1268,8 @@ struct ondisk_cache_entry_extended {
 			    ondisk_cache_entry_extended_size(ce_namelen(ce)) : \
 			    ondisk_cache_entry_size(ce_namelen(ce)))
 
-static int verify_hdr(struct cache_header *hdr, unsigned long size)
+static int verify_hdr_version(struct cache_version_header *hdr, unsigned long size)
 {
-	git_SHA_CTX c;
-	unsigned char sha1[20];
 	int hdr_version;
 
 	if (hdr->hdr_signature != htonl(CACHE_SIGNATURE))
@@ -1279,10 +1277,22 @@ static int verify_hdr(struct cache_header *hdr, unsigned long size)
 	hdr_version = ntohl(hdr->hdr_version);
 	if (hdr_version < INDEX_FORMAT_LB || INDEX_FORMAT_UB < hdr_version)
 		return error("bad index version %d", hdr_version);
+	return 0;
+}
+
+static int verify_hdr(void *mmap, unsigned long size)
+{
+	git_SHA_CTX c;
+	unsigned char sha1[20];
+
+	if (size < sizeof(struct cache_version_header)
+	    + sizeof(struct cache_header) + 20)
+		die("index file smaller than expected");
+
 	git_SHA1_Init(&c);
-	git_SHA1_Update(&c, hdr, size - 20);
+	git_SHA1_Update(&c, mmap, size - 20);
 	git_SHA1_Final(sha1, &c);
-	if (hashcmp(sha1, (unsigned char *)hdr + size - 20))
+	if (hashcmp(sha1, (unsigned char *)mmap + size - 20))
 		return error("bad index file sha1 signature");
 	return 0;
 }
@@ -1424,47 +1434,19 @@ static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk,
 	return ce;
 }
 
-/* remember to discard_cache() before reading a different cache! */
-int read_index_from(struct index_state *istate, const char *path)
+static int read_index_v2(struct index_state *istate, void *mmap, unsigned long mmap_size)
 {
-	int fd, i;
-	struct stat st;
+	int i;
 	unsigned long src_offset;
-	struct cache_header *hdr;
-	void *mmap;
-	size_t mmap_size;
+	struct cache_version_header *hdr;
+	struct cache_header *hdr_v2;
 	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
 
-	if (istate->initialized)
-		return istate->cache_nr;
-
-	istate->timestamp.sec = 0;
-	istate->timestamp.nsec = 0;
-	fd = open(path, O_RDONLY);
-	if (fd < 0) {
-		if (errno == ENOENT)
-			return 0;
-		die_errno("index file open failed");
-	}
-
-	if (fstat(fd, &st))
-		die_errno("cannot stat the open index");
-
-	mmap_size = xsize_t(st.st_size);
-	if (mmap_size < sizeof(struct cache_header) + 20)
-		die("index file smaller than expected");
-
-	mmap = xmmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
-	if (mmap == MAP_FAILED)
-		die_errno("unable to map index file");
-	close(fd);
-
 	hdr = mmap;
-	if (verify_hdr(hdr, mmap_size) < 0)
-		goto unmap;
+	hdr_v2 = (struct cache_header *)((char *)mmap + sizeof(*hdr));
 
 	istate->version = ntohl(hdr->hdr_version);
-	istate->cache_nr = ntohl(hdr->hdr_entries);
+	istate->cache_nr = ntohl(hdr_v2->hdr_entries);
 	istate->cache_alloc = alloc_nr(istate->cache_nr);
 	istate->cache = xcalloc(istate->cache_alloc, sizeof(*istate->cache));
 	istate->initialized = 1;
@@ -1474,7 +1456,7 @@ int read_index_from(struct index_state *istate, const char *path)
 	else
 		previous_name = NULL;
 
-	src_offset = sizeof(*hdr);
+	src_offset = sizeof(*hdr) + sizeof(*hdr_v2);
 	for (i = 0; i < istate->cache_nr; i++) {
 		struct ondisk_cache_entry *disk_ce;
 		struct cache_entry *ce;
@@ -1487,8 +1469,6 @@ int read_index_from(struct index_state *istate, const char *path)
 		src_offset += consumed;
 	}
 	strbuf_release(&previous_name_buf);
-	istate->timestamp.sec = st.st_mtime;
-	istate->timestamp.nsec = ST_MTIME_NSEC(st);
 
 	while (src_offset <= mmap_size - 20 - 8) {
 		/* After an array of active_nr index entries,
@@ -1508,6 +1488,58 @@ int read_index_from(struct index_state *istate, const char *path)
 		src_offset += 8;
 		src_offset += extsize;
 	}
+	return 0;
+unmap:
+	munmap(mmap, mmap_size);
+	die("index file corrupt");
+}
+
+/* remember to discard_cache() before reading a different cache! */
+int read_index_from(struct index_state *istate, const char *path)
+{
+	int fd;
+	struct stat st;
+	struct cache_version_header *hdr;
+	void *mmap;
+	size_t mmap_size;
+
+	errno = EBUSY;
+	if (istate->initialized)
+		return istate->cache_nr;
+
+	errno = ENOENT;
+	istate->timestamp.sec = 0;
+	istate->timestamp.nsec = 0;
+	fd = open(path, O_RDONLY);
+	if (fd < 0) {
+		if (errno == ENOENT)
+			return 0;
+		die_errno("index file open failed");
+	}
+
+	if (fstat(fd, &st))
+		die_errno("cannot stat the open index");
+
+	errno = EINVAL;
+	mmap_size = xsize_t(st.st_size);
+	if (mmap_size < sizeof(struct cache_header) + 20)
+		die("index file smaller than expected");
+
+	mmap = xmmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
+	close(fd);
+	if (mmap == MAP_FAILED)
+		die_errno("unable to map index file");
+
+	hdr = mmap;
+	if (verify_hdr_version(hdr, mmap_size) < 0)
+		goto unmap;
+
+	if (verify_hdr(mmap, mmap_size) < 0)
+		goto unmap;
+
+	read_index_v2(istate, mmap, mmap_size);
+	istate->timestamp.sec = st.st_mtime;
+	istate->timestamp.nsec = ST_MTIME_NSEC(st);
 	munmap(mmap, mmap_size);
 	return istate->cache_nr;
 
@@ -1771,10 +1803,11 @@ void update_index_if_able(struct index_state *istate, struct lock_file *lockfile
 		rollback_lock_file(lockfile);
 }
 
-int write_index(struct index_state *istate, int newfd)
+static int write_index_v2(struct index_state *istate, int newfd)
 {
 	git_SHA_CTX c;
-	struct cache_header hdr;
+	struct cache_version_header hdr;
+	struct cache_header hdr_v2;
 	int i, err, removed, extended, hdr_version;
 	struct cache_entry **cache = istate->cache;
 	int entries = istate->cache_nr;
@@ -1804,11 +1837,13 @@ int write_index(struct index_state *istate, int newfd)
 
 	hdr.hdr_signature = htonl(CACHE_SIGNATURE);
 	hdr.hdr_version = htonl(hdr_version);
-	hdr.hdr_entries = htonl(entries - removed);
+	hdr_v2.hdr_entries = htonl(entries - removed);
 
 	git_SHA1_Init(&c);
 	if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0)
 		return -1;
+	if (ce_write(&c, newfd, &hdr_v2, sizeof(hdr_v2)) < 0)
+		return -1;
 
 	previous_name = (hdr_version == 4) ? &previous_name_buf : NULL;
 	for (i = 0; i < entries; i++) {
@@ -1854,6 +1889,11 @@ int write_index(struct index_state *istate, int newfd)
 	return 0;
 }
 
+int write_index(struct index_state *istate, int newfd)
+{
+	return write_index_v2(istate, newfd);
+}
+
 /*
  * Read the index file that is potentially unmerged into given
  * index_state, dropping any unmerged entries.  Returns true if
diff --git a/test-index-version.c b/test-index-version.c
index 05d4699..4c0386f 100644
--- a/test-index-version.c
+++ b/test-index-version.c
@@ -2,7 +2,7 @@
 
 int main(int argc, char **argv)
 {
-	struct cache_header hdr;
+	struct cache_version_header hdr;
 	int version;
 
 	memset(&hdr,0,sizeof(hdr));
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 03/22] read-cache: move index v2 specific functions to their own file
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 01/22] t2104: Don't fail for index versions other than [23] Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 02/22] read-cache: split index file version specific functionality Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 04/22] read-cache: Re-read index if index file changed Thomas Gummerer
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Move index version 2 specific functions to their own file. The non-index
specific functions will be in read-cache.c, while the index version 2
specific functions will be in read-cache-v2.c.

Helped-by: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 Makefile             |   2 +
 cache.h              |  16 +-
 read-cache-v2.c      | 556 +++++++++++++++++++++++++++++++++++++++++++++++++
 read-cache.c         | 575 ++++-----------------------------------------------
 read-cache.h         |  57 +++++
 test-index-version.c |   5 +
 6 files changed, 661 insertions(+), 550 deletions(-)
 create mode 100644 read-cache-v2.c
 create mode 100644 read-cache.h

diff --git a/Makefile b/Makefile
index 5a68fe5..73369ae 100644
--- a/Makefile
+++ b/Makefile
@@ -711,6 +711,7 @@ LIB_H += progress.h
 LIB_H += prompt.h
 LIB_H += quote.h
 LIB_H += reachable.h
+LIB_H += read-cache.h
 LIB_H += reflog-walk.h
 LIB_H += refs.h
 LIB_H += remote.h
@@ -854,6 +855,7 @@ LIB_OBJS += prompt.o
 LIB_OBJS += quote.o
 LIB_OBJS += reachable.o
 LIB_OBJS += read-cache.o
+LIB_OBJS += read-cache-v2.o
 LIB_OBJS += reflog-walk.o
 LIB_OBJS += refs.o
 LIB_OBJS += remote.o
diff --git a/cache.h b/cache.h
index 7af853b..5082b34 100644
--- a/cache.h
+++ b/cache.h
@@ -95,19 +95,8 @@ unsigned long git_deflate_bound(git_zstream *, unsigned long);
  */
 #define DEFAULT_GIT_PORT 9418
 
-/*
- * Basic data structures for the directory cache
- */
 
 #define CACHE_SIGNATURE 0x44495243	/* "DIRC" */
-struct cache_version_header {
-	unsigned int hdr_signature;
-	unsigned int hdr_version;
-};
-
-struct cache_header {
-	unsigned int hdr_entries;
-};
 
 #define INDEX_FORMAT_LB 2
 #define INDEX_FORMAT_UB 4
@@ -280,6 +269,7 @@ struct index_state {
 		 initialized : 1;
 	struct hash_table name_hash;
 	struct hash_table dir_hash;
+	struct index_ops *ops;
 };
 
 extern struct index_state the_index;
@@ -489,8 +479,8 @@ extern void *read_blob_data_from_index(struct index_state *, const char *, unsig
 #define CE_MATCH_RACY_IS_DIRTY		02
 /* do stat comparison even if CE_SKIP_WORKTREE is true */
 #define CE_MATCH_IGNORE_SKIP_WORKTREE	04
-extern int ie_match_stat(const struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
-extern int ie_modified(const struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
+extern int ie_match_stat(struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
+extern int ie_modified(struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
 
 #define PATHSPEC_ONESTAR 1	/* the pathspec pattern sastisfies GFNM_ONESTAR */
 
diff --git a/read-cache-v2.c b/read-cache-v2.c
new file mode 100644
index 0000000..a6883c3
--- /dev/null
+++ b/read-cache-v2.c
@@ -0,0 +1,556 @@
+#include "cache.h"
+#include "read-cache.h"
+#include "resolve-undo.h"
+#include "cache-tree.h"
+#include "varint.h"
+
+/* Mask for the name length in ce_flags in the on-disk index */
+#define CE_NAMEMASK  (0x0fff)
+
+struct cache_header {
+	unsigned int hdr_entries;
+};
+
+/*****************************************************************
+ * Index File I/O
+ *****************************************************************/
+
+/*
+ * dev/ino/uid/gid/size are also just tracked to the low 32 bits
+ * Again - this is just a (very strong in practice) heuristic that
+ * the inode hasn't changed.
+ *
+ * We save the fields in big-endian order to allow using the
+ * index file over NFS transparently.
+ */
+struct ondisk_cache_entry {
+	struct cache_time ctime;
+	struct cache_time mtime;
+	unsigned int dev;
+	unsigned int ino;
+	unsigned int mode;
+	unsigned int uid;
+	unsigned int gid;
+	unsigned int size;
+	unsigned char sha1[20];
+	unsigned short flags;
+	char name[FLEX_ARRAY]; /* more */
+};
+
+/*
+ * This struct is used when CE_EXTENDED bit is 1
+ * The struct must match ondisk_cache_entry exactly from
+ * ctime till flags
+ */
+struct ondisk_cache_entry_extended {
+	struct cache_time ctime;
+	struct cache_time mtime;
+	unsigned int dev;
+	unsigned int ino;
+	unsigned int mode;
+	unsigned int uid;
+	unsigned int gid;
+	unsigned int size;
+	unsigned char sha1[20];
+	unsigned short flags;
+	unsigned short flags2;
+	char name[FLEX_ARRAY]; /* more */
+};
+
+/* These are only used for v3 or lower */
+#define align_flex_name(STRUCT,len) ((offsetof(struct STRUCT,name) + (len) + 8) & ~7)
+#define ondisk_cache_entry_size(len) align_flex_name(ondisk_cache_entry,len)
+#define ondisk_cache_entry_extended_size(len) align_flex_name(ondisk_cache_entry_extended,len)
+#define ondisk_ce_size(ce) (((ce)->ce_flags & CE_EXTENDED) ? \
+			    ondisk_cache_entry_extended_size(ce_namelen(ce)) : \
+			    ondisk_cache_entry_size(ce_namelen(ce)))
+
+static int verify_hdr(void *mmap, unsigned long size)
+{
+	git_SHA_CTX c;
+	unsigned char sha1[20];
+
+	if (size < sizeof(struct cache_version_header)
+			+ sizeof(struct cache_header) + 20)
+		die("index file smaller than expected");
+
+	git_SHA1_Init(&c);
+	git_SHA1_Update(&c, mmap, size - 20);
+	git_SHA1_Final(sha1, &c);
+	if (hashcmp(sha1, (unsigned char *)mmap + size - 20))
+		return error("bad index file sha1 signature");
+	return 0;
+}
+
+static int match_stat_basic(const struct cache_entry *ce,
+			    struct stat *st, int changed)
+{
+	changed |= match_stat_data(&ce->ce_stat_data, st);
+
+	/* Racily smudged entry? */
+	if (!ce->ce_stat_data.sd_size) {
+		if (!is_empty_blob_sha1(ce->sha1))
+			changed |= DATA_CHANGED;
+	}
+	return changed;
+}
+
+static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk,
+						   unsigned int flags,
+						   const char *name,
+						   size_t len)
+{
+	struct cache_entry *ce = xmalloc(cache_entry_size(len));
+
+	ce->ce_stat_data.sd_ctime.sec = ntoh_l(ondisk->ctime.sec);
+	ce->ce_stat_data.sd_mtime.sec = ntoh_l(ondisk->mtime.sec);
+	ce->ce_stat_data.sd_ctime.nsec = ntoh_l(ondisk->ctime.nsec);
+	ce->ce_stat_data.sd_mtime.nsec = ntoh_l(ondisk->mtime.nsec);
+	ce->ce_stat_data.sd_dev   = ntoh_l(ondisk->dev);
+	ce->ce_stat_data.sd_ino   = ntoh_l(ondisk->ino);
+	ce->ce_mode  = ntoh_l(ondisk->mode);
+	ce->ce_stat_data.sd_uid   = ntoh_l(ondisk->uid);
+	ce->ce_stat_data.sd_gid   = ntoh_l(ondisk->gid);
+	ce->ce_stat_data.sd_size  = ntoh_l(ondisk->size);
+	ce->ce_flags = flags & ~CE_NAMEMASK;
+	ce->ce_namelen = len;
+	hashcpy(ce->sha1, ondisk->sha1);
+	memcpy(ce->name, name, len);
+	ce->name[len] = '\0';
+	return ce;
+}
+
+/*
+ * Adjacent cache entries tend to share the leading paths, so it makes
+ * sense to only store the differences in later entries.  In the v4
+ * on-disk format of the index, each on-disk cache entry stores the
+ * number of bytes to be stripped from the end of the previous name,
+ * and the bytes to append to the result, to come up with its name.
+ */
+static unsigned long expand_name_field(struct strbuf *name, const char *cp_)
+{
+	const unsigned char *ep, *cp = (const unsigned char *)cp_;
+	size_t len = decode_varint(&cp);
+
+	if (name->len < len)
+		die("malformed name field in the index");
+	strbuf_remove(name, name->len - len, len);
+	for (ep = cp; *ep; ep++)
+		; /* find the end */
+	strbuf_add(name, cp, ep - cp);
+	return (const char *)ep + 1 - cp_;
+}
+
+static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk,
+					    unsigned long *ent_size,
+					    struct strbuf *previous_name)
+{
+	struct cache_entry *ce;
+	size_t len;
+	const char *name;
+	unsigned int flags;
+
+	/* On-disk flags are just 16 bits */
+	flags = ntoh_s(ondisk->flags);
+	len = flags & CE_NAMEMASK;
+
+	if (flags & CE_EXTENDED) {
+		struct ondisk_cache_entry_extended *ondisk2;
+		int extended_flags;
+		ondisk2 = (struct ondisk_cache_entry_extended *)ondisk;
+		extended_flags = ntoh_s(ondisk2->flags2) << 16;
+		/* We do not yet understand any bit out of CE_EXTENDED_FLAGS */
+		if (extended_flags & ~CE_EXTENDED_FLAGS)
+			die("Unknown index entry format %08x", extended_flags);
+		flags |= extended_flags;
+		name = ondisk2->name;
+	}
+	else
+		name = ondisk->name;
+
+	if (!previous_name) {
+		/* v3 and earlier */
+		if (len == CE_NAMEMASK)
+			len = strlen(name);
+		ce = cache_entry_from_ondisk(ondisk, flags, name, len);
+
+		*ent_size = ondisk_ce_size(ce);
+	} else {
+		unsigned long consumed;
+		consumed = expand_name_field(previous_name, name);
+		ce = cache_entry_from_ondisk(ondisk, flags,
+					     previous_name->buf,
+					     previous_name->len);
+
+		*ent_size = (name - ((char *)ondisk)) + consumed;
+	}
+	return ce;
+}
+
+static int read_index_extension(struct index_state *istate,
+				const char *ext, void *data, unsigned long sz)
+{
+	switch (CACHE_EXT(ext)) {
+	case CACHE_EXT_TREE:
+		istate->cache_tree = cache_tree_read(data, sz);
+		break;
+	case CACHE_EXT_RESOLVE_UNDO:
+		istate->resolve_undo = resolve_undo_read(data, sz);
+		break;
+	default:
+		if (*ext < 'A' || 'Z' < *ext)
+			return error("index uses %.4s extension, which we do not understand",
+				     ext);
+		fprintf(stderr, "ignoring %.4s extension\n", ext);
+		break;
+	}
+	return 0;
+}
+
+static int read_index_v2(struct index_state *istate, void *mmap,
+			 unsigned long mmap_size)
+{
+	int i;
+	unsigned long src_offset;
+	struct cache_version_header *hdr;
+	struct cache_header *hdr_v2;
+	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
+
+	hdr = mmap;
+	hdr_v2 = (struct cache_header *)((char *)mmap + sizeof(*hdr));
+	istate->version = ntohl(hdr->hdr_version);
+	istate->cache_nr = ntohl(hdr_v2->hdr_entries);
+	istate->cache_alloc = alloc_nr(istate->cache_nr);
+	istate->cache = xcalloc(istate->cache_alloc, sizeof(struct cache_entry *));
+	istate->initialized = 1;
+
+	if (istate->version == 4)
+		previous_name = &previous_name_buf;
+	else
+		previous_name = NULL;
+
+	src_offset = sizeof(*hdr) + sizeof(*hdr_v2);
+	for (i = 0; i < istate->cache_nr; i++) {
+		struct ondisk_cache_entry *disk_ce;
+		struct cache_entry *ce;
+		unsigned long consumed;
+
+		disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
+		ce = create_from_disk(disk_ce, &consumed, previous_name);
+		set_index_entry(istate, i, ce);
+
+		src_offset += consumed;
+	}
+	strbuf_release(&previous_name_buf);
+
+	while (src_offset <= mmap_size - 20 - 8) {
+		/* After an array of active_nr index entries,
+		 * there can be arbitrary number of extended
+		 * sections, each of which is prefixed with
+		 * extension name (4-byte) and section length
+		 * in 4-byte network byte order.
+		 */
+		uint32_t extsize;
+		memcpy(&extsize, (char *)mmap + src_offset + 4, 4);
+		extsize = ntohl(extsize);
+		if (read_index_extension(istate,
+					(const char *) mmap + src_offset,
+					(char *) mmap + src_offset + 8,
+					extsize) < 0)
+			goto unmap;
+		src_offset += 8;
+		src_offset += extsize;
+	}
+	return 0;
+unmap:
+	munmap(mmap, mmap_size);
+	die("index file corrupt");
+}
+
+#define WRITE_BUFFER_SIZE 8192
+static unsigned char write_buffer[WRITE_BUFFER_SIZE];
+static unsigned long write_buffer_len;
+
+static int ce_write_flush(git_SHA_CTX *context, int fd)
+{
+	unsigned int buffered = write_buffer_len;
+	if (buffered) {
+		git_SHA1_Update(context, write_buffer, buffered);
+		if (write_in_full(fd, write_buffer, buffered) != buffered)
+			return -1;
+		write_buffer_len = 0;
+	}
+	return 0;
+}
+
+static int ce_write(git_SHA_CTX *context, int fd, void *data, unsigned int len)
+{
+	while (len) {
+		unsigned int buffered = write_buffer_len;
+		unsigned int partial = WRITE_BUFFER_SIZE - buffered;
+		if (partial > len)
+			partial = len;
+		memcpy(write_buffer + buffered, data, partial);
+		buffered += partial;
+		if (buffered == WRITE_BUFFER_SIZE) {
+			write_buffer_len = buffered;
+			if (ce_write_flush(context, fd))
+				return -1;
+			buffered = 0;
+		}
+		write_buffer_len = buffered;
+		len -= partial;
+		data = (char *) data + partial;
+	}
+	return 0;
+}
+
+static int write_index_ext_header(git_SHA_CTX *context, int fd,
+				  unsigned int ext, unsigned int sz)
+{
+	ext = htonl(ext);
+	sz = htonl(sz);
+	return ((ce_write(context, fd, &ext, 4) < 0) ||
+		(ce_write(context, fd, &sz, 4) < 0)) ? -1 : 0;
+}
+
+static int ce_flush(git_SHA_CTX *context, int fd)
+{
+	unsigned int left = write_buffer_len;
+
+	if (left) {
+		write_buffer_len = 0;
+		git_SHA1_Update(context, write_buffer, left);
+	}
+
+	/* Flush first if not enough space for SHA1 signature */
+	if (left + 20 > WRITE_BUFFER_SIZE) {
+		if (write_in_full(fd, write_buffer, left) != left)
+			return -1;
+		left = 0;
+	}
+
+	/* Append the SHA1 signature at the end */
+	git_SHA1_Final(write_buffer + left, context);
+	left += 20;
+	return (write_in_full(fd, write_buffer, left) != left) ? -1 : 0;
+}
+
+static void ce_smudge_racily_clean_entry(struct index_state *istate, struct cache_entry *ce)
+{
+	/*
+	 * The only thing we care about in this function is to smudge the
+	 * falsely clean entry due to touch-update-touch race, so we leave
+	 * everything else as they are.  We are called for entries whose
+	 * ce_stat_data.sd_mtime match the index file mtime.
+	 *
+	 * Note that this actually does not do much for gitlinks, for
+	 * which ce_match_stat_basic() always goes to the actual
+	 * contents.  The caller checks with is_racy_timestamp() which
+	 * always says "no" for gitlinks, so we are not called for them ;-)
+	 */
+	struct stat st;
+
+	if (lstat(ce->name, &st) < 0)
+		return;
+	if (ce_match_stat_basic(istate, ce, &st))
+		return;
+	if (ce_modified_check_fs(ce, &st)) {
+		/* This is "racily clean"; smudge it.  Note that this
+		 * is a tricky code.  At first glance, it may appear
+		 * that it can break with this sequence:
+		 *
+		 * $ echo xyzzy >frotz
+		 * $ git-update-index --add frotz
+		 * $ : >frotz
+		 * $ sleep 3
+		 * $ echo filfre >nitfol
+		 * $ git-update-index --add nitfol
+		 *
+		 * but it does not.  When the second update-index runs,
+		 * it notices that the entry "frotz" has the same timestamp
+		 * as index, and if we were to smudge it by resetting its
+		 * size to zero here, then the object name recorded
+		 * in index is the 6-byte file but the cached stat information
+		 * becomes zero --- which would then match what we would
+		 * obtain from the filesystem next time we stat("frotz").
+		 *
+		 * However, the second update-index, before calling
+		 * this function, notices that the cached size is 6
+		 * bytes and what is on the filesystem is an empty
+		 * file, and never calls us, so the cached size information
+		 * for "frotz" stays 6 which does not match the filesystem.
+		 */
+		ce->ce_stat_data.sd_size = 0;
+	}
+}
+
+/* Copy miscellaneous fields but not the name */
+static char *copy_cache_entry_to_ondisk(struct ondisk_cache_entry *ondisk,
+				       struct cache_entry *ce)
+{
+	short flags;
+
+	ondisk->ctime.sec = htonl(ce->ce_stat_data.sd_ctime.sec);
+	ondisk->mtime.sec = htonl(ce->ce_stat_data.sd_mtime.sec);
+	ondisk->ctime.nsec = htonl(ce->ce_stat_data.sd_ctime.nsec);
+	ondisk->mtime.nsec = htonl(ce->ce_stat_data.sd_mtime.nsec);
+	ondisk->dev  = htonl(ce->ce_stat_data.sd_dev);
+	ondisk->ino  = htonl(ce->ce_stat_data.sd_ino);
+	ondisk->mode = htonl(ce->ce_mode);
+	ondisk->uid  = htonl(ce->ce_stat_data.sd_uid);
+	ondisk->gid  = htonl(ce->ce_stat_data.sd_gid);
+	ondisk->size = htonl(ce->ce_stat_data.sd_size);
+	hashcpy(ondisk->sha1, ce->sha1);
+
+	flags = ce->ce_flags;
+	flags |= (ce_namelen(ce) >= CE_NAMEMASK ? CE_NAMEMASK : ce_namelen(ce));
+	ondisk->flags = htons(flags);
+	if (ce->ce_flags & CE_EXTENDED) {
+		struct ondisk_cache_entry_extended *ondisk2;
+		ondisk2 = (struct ondisk_cache_entry_extended *)ondisk;
+		ondisk2->flags2 = htons((ce->ce_flags & CE_EXTENDED_FLAGS) >> 16);
+		return ondisk2->name;
+	}
+	else {
+		return ondisk->name;
+	}
+}
+
+static int ce_write_entry(git_SHA_CTX *c, int fd, struct cache_entry *ce,
+			  struct strbuf *previous_name)
+{
+	int size;
+	struct ondisk_cache_entry *ondisk;
+	char *name;
+	int result;
+
+	if (!previous_name) {
+		size = ondisk_ce_size(ce);
+		ondisk = xcalloc(1, size);
+		name = copy_cache_entry_to_ondisk(ondisk, ce);
+		memcpy(name, ce->name, ce_namelen(ce));
+	} else {
+		int common, to_remove, prefix_size;
+		unsigned char to_remove_vi[16];
+		for (common = 0;
+		     (ce->name[common] &&
+		      common < previous_name->len &&
+		      ce->name[common] == previous_name->buf[common]);
+		     common++)
+			; /* still matching */
+		to_remove = previous_name->len - common;
+		prefix_size = encode_varint(to_remove, to_remove_vi);
+
+		if (ce->ce_flags & CE_EXTENDED)
+			size = offsetof(struct ondisk_cache_entry_extended, name);
+		else
+			size = offsetof(struct ondisk_cache_entry, name);
+		size += prefix_size + (ce_namelen(ce) - common + 1);
+
+		ondisk = xcalloc(1, size);
+		name = copy_cache_entry_to_ondisk(ondisk, ce);
+		memcpy(name, to_remove_vi, prefix_size);
+		memcpy(name + prefix_size, ce->name + common, ce_namelen(ce) - common);
+
+		strbuf_splice(previous_name, common, to_remove,
+			      ce->name + common, ce_namelen(ce) - common);
+	}
+
+	result = ce_write(c, fd, ondisk, size);
+	free(ondisk);
+	return result;
+}
+
+static int write_index_v2(struct index_state *istate, int newfd)
+{
+	git_SHA_CTX c;
+	struct cache_version_header hdr;
+	struct cache_header hdr_v2;
+	int i, err, removed, extended, hdr_version;
+	struct cache_entry **cache = istate->cache;
+	int entries = istate->cache_nr;
+	struct stat st;
+	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
+
+	for (i = removed = extended = 0; i < entries; i++) {
+		if (cache[i]->ce_flags & CE_REMOVE)
+			removed++;
+
+		/* reduce extended entries if possible */
+		cache[i]->ce_flags &= ~CE_EXTENDED;
+		if (cache[i]->ce_flags & CE_EXTENDED_FLAGS) {
+			extended++;
+			cache[i]->ce_flags |= CE_EXTENDED;
+		}
+	}
+
+	if (!istate->version)
+		istate->version = INDEX_FORMAT_DEFAULT;
+
+	/* demote version 3 to version 2 when the latter suffices */
+	if (istate->version == 3 || istate->version == 2)
+		istate->version = extended ? 3 : 2;
+
+	hdr_version = istate->version;
+
+	hdr.hdr_signature = htonl(CACHE_SIGNATURE);
+	hdr.hdr_version = htonl(hdr_version);
+	hdr_v2.hdr_entries = htonl(entries - removed);
+
+	git_SHA1_Init(&c);
+	if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0)
+		return -1;
+	if (ce_write(&c, newfd, &hdr_v2, sizeof(hdr_v2)) < 0)
+		return -1;
+
+	previous_name = (hdr_version == 4) ? &previous_name_buf : NULL;
+	for (i = 0; i < entries; i++) {
+		struct cache_entry *ce = cache[i];
+		if (ce->ce_flags & CE_REMOVE)
+			continue;
+		if (!ce_uptodate(ce) && is_racy_timestamp(istate, ce))
+			ce_smudge_racily_clean_entry(istate, ce);
+		if (is_null_sha1(ce->sha1))
+			return error("cache entry has null sha1: %s", ce->name);
+		if (ce_write_entry(&c, newfd, ce, previous_name) < 0)
+			return -1;
+	}
+	strbuf_release(&previous_name_buf);
+
+	/* Write extension data here */
+	if (istate->cache_tree) {
+		struct strbuf sb = STRBUF_INIT;
+
+		cache_tree_write(&sb, istate->cache_tree);
+		err = write_index_ext_header(&c, newfd, CACHE_EXT_TREE, sb.len) < 0
+			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		strbuf_release(&sb);
+		if (err)
+			return -1;
+	}
+	if (istate->resolve_undo) {
+		struct strbuf sb = STRBUF_INIT;
+
+		resolve_undo_write(&sb, istate->resolve_undo);
+		err = write_index_ext_header(&c, newfd, CACHE_EXT_RESOLVE_UNDO,
+					     sb.len) < 0
+			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		strbuf_release(&sb);
+		if (err)
+			return -1;
+	}
+
+	if (ce_flush(&c, newfd) || fstat(newfd, &st))
+		return -1;
+	istate->timestamp.sec = (unsigned int)st.st_mtime;
+	istate->timestamp.nsec = ST_MTIME_NSEC(st);
+	return 0;
+}
+
+struct index_ops v2_ops = {
+	match_stat_basic,
+	verify_hdr,
+	read_index_v2,
+	write_index_v2
+};
diff --git a/read-cache.c b/read-cache.c
index 93947bf..1e7ffc2 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -5,6 +5,7 @@
  */
 #define NO_THE_INDEX_COMPATIBILITY_MACROS
 #include "cache.h"
+#include "read-cache.h"
 #include "cache-tree.h"
 #include "refs.h"
 #include "dir.h"
@@ -17,26 +18,9 @@
 
 static struct cache_entry *refresh_cache_entry(struct cache_entry *ce, int really);
 
-/* Mask for the name length in ce_flags in the on-disk index */
-
-#define CE_NAMEMASK  (0x0fff)
-
-/* Index extensions.
- *
- * The first letter should be 'A'..'Z' for extensions that are not
- * necessary for a correct operation (i.e. optimization data).
- * When new extensions are added that _needs_ to be understood in
- * order to correctly interpret the index file, pick character that
- * is outside the range, to cause the reader to abort.
- */
-
-#define CACHE_EXT(s) ( (s[0]<<24)|(s[1]<<16)|(s[2]<<8)|(s[3]) )
-#define CACHE_EXT_TREE 0x54524545	/* "TREE" */
-#define CACHE_EXT_RESOLVE_UNDO 0x52455543 /* "REUC" */
-
 struct index_state the_index;
 
-static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
 	istate->cache[nr] = ce;
 	add_name_hash(istate, ce);
@@ -190,7 +174,7 @@ static int ce_compare_gitlink(const struct cache_entry *ce)
 	return hashcmp(sha1, ce->sha1);
 }
 
-static int ce_modified_check_fs(const struct cache_entry *ce, struct stat *st)
+int ce_modified_check_fs(const struct cache_entry *ce, struct stat *st)
 {
 	switch (st->st_mode & S_IFMT) {
 	case S_IFREG:
@@ -210,7 +194,21 @@ static int ce_modified_check_fs(const struct cache_entry *ce, struct stat *st)
 	return 0;
 }
 
-static int ce_match_stat_basic(const struct cache_entry *ce, struct stat *st)
+/*
+ * Check if the reading/writing operations are set and set them
+ * to the correct version
+ */
+static void set_istate_ops(struct index_state *istate)
+{
+	if (!istate->version)
+		istate->version = INDEX_FORMAT_DEFAULT;
+
+	if (istate->version >= 2 && istate->version <= 4)
+		istate->ops = &v2_ops;
+}
+
+int ce_match_stat_basic(struct index_state *istate,
+			const struct cache_entry *ce, struct stat *st)
 {
 	unsigned int changed = 0;
 
@@ -243,19 +241,14 @@ static int ce_match_stat_basic(const struct cache_entry *ce, struct stat *st)
 		die("internal error: ce_mode is %o", ce->ce_mode);
 	}
 
-	changed |= match_stat_data(&ce->ce_stat_data, st);
-
-	/* Racily smudged entry? */
-	if (!ce->ce_stat_data.sd_size) {
-		if (!is_empty_blob_sha1(ce->sha1))
-			changed |= DATA_CHANGED;
-	}
-
+	set_istate_ops(istate);
+	changed = istate->ops->match_stat_basic(ce, st, changed);
 	return changed;
 }
 
-static int is_racy_timestamp(const struct index_state *istate,
-			     const struct cache_entry *ce)
+
+int is_racy_timestamp(const struct index_state *istate,
+		      const struct cache_entry *ce)
 {
 	return (!S_ISGITLINK(ce->ce_mode) &&
 		istate->timestamp.sec &&
@@ -270,9 +263,8 @@ static int is_racy_timestamp(const struct index_state *istate,
 		 );
 }
 
-int ie_match_stat(const struct index_state *istate,
-		  const struct cache_entry *ce, struct stat *st,
-		  unsigned int options)
+int ie_match_stat(struct index_state *istate, const struct cache_entry *ce,
+		  struct stat *st, unsigned int options)
 {
 	unsigned int changed;
 	int ignore_valid = options & CE_MATCH_IGNORE_VALID;
@@ -298,7 +290,7 @@ int ie_match_stat(const struct index_state *istate,
 	if (ce->ce_flags & CE_INTENT_TO_ADD)
 		return DATA_CHANGED | TYPE_CHANGED | MODE_CHANGED;
 
-	changed = ce_match_stat_basic(ce, st);
+	changed = ce_match_stat_basic(istate, ce, st);
 
 	/*
 	 * Within 1 second of this sequence:
@@ -326,8 +318,7 @@ int ie_match_stat(const struct index_state *istate,
 	return changed;
 }
 
-int ie_modified(const struct index_state *istate,
-		const struct cache_entry *ce,
+int ie_modified(struct index_state *istate, const struct cache_entry *ce,
 		struct stat *st, unsigned int options)
 {
 	int changed, changed_fs;
@@ -1211,13 +1202,10 @@ static struct cache_entry *refresh_cache_entry(struct cache_entry *ce, int reall
 	return refresh_cache_ent(&the_index, ce, really, NULL, NULL);
 }
 
-
 /*****************************************************************
  * Index File I/O
  *****************************************************************/
 
-#define INDEX_FORMAT_DEFAULT 3
-
 /*
  * dev/ino/uid/gid/size are also just tracked to the low 32 bits
  * Again - this is just a (very strong in practice) heuristic that
@@ -1268,7 +1256,8 @@ struct ondisk_cache_entry_extended {
 			    ondisk_cache_entry_extended_size(ce_namelen(ce)) : \
 			    ondisk_cache_entry_size(ce_namelen(ce)))
 
-static int verify_hdr_version(struct cache_version_header *hdr, unsigned long size)
+static int verify_hdr_version(struct index_state *istate,
+			      struct cache_version_header *hdr, unsigned long size)
 {
 	int hdr_version;
 
@@ -1277,43 +1266,7 @@ static int verify_hdr_version(struct cache_version_header *hdr, unsigned long si
 	hdr_version = ntohl(hdr->hdr_version);
 	if (hdr_version < INDEX_FORMAT_LB || INDEX_FORMAT_UB < hdr_version)
 		return error("bad index version %d", hdr_version);
-	return 0;
-}
-
-static int verify_hdr(void *mmap, unsigned long size)
-{
-	git_SHA_CTX c;
-	unsigned char sha1[20];
-
-	if (size < sizeof(struct cache_version_header)
-	    + sizeof(struct cache_header) + 20)
-		die("index file smaller than expected");
-
-	git_SHA1_Init(&c);
-	git_SHA1_Update(&c, mmap, size - 20);
-	git_SHA1_Final(sha1, &c);
-	if (hashcmp(sha1, (unsigned char *)mmap + size - 20))
-		return error("bad index file sha1 signature");
-	return 0;
-}
-
-static int read_index_extension(struct index_state *istate,
-				const char *ext, void *data, unsigned long sz)
-{
-	switch (CACHE_EXT(ext)) {
-	case CACHE_EXT_TREE:
-		istate->cache_tree = cache_tree_read(data, sz);
-		break;
-	case CACHE_EXT_RESOLVE_UNDO:
-		istate->resolve_undo = resolve_undo_read(data, sz);
-		break;
-	default:
-		if (*ext < 'A' || 'Z' < *ext)
-			return error("index uses %.4s extension, which we do not understand",
-				     ext);
-		fprintf(stderr, "ignoring %.4s extension\n", ext);
-		break;
-	}
+	istate->ops = &v2_ops;
 	return 0;
 }
 
@@ -1322,178 +1275,6 @@ int read_index(struct index_state *istate)
 	return read_index_from(istate, get_index_file());
 }
 
-#ifndef NEEDS_ALIGNED_ACCESS
-#define ntoh_s(var) ntohs(var)
-#define ntoh_l(var) ntohl(var)
-#else
-static inline uint16_t ntoh_s_force_align(void *p)
-{
-	uint16_t x;
-	memcpy(&x, p, sizeof(x));
-	return ntohs(x);
-}
-static inline uint32_t ntoh_l_force_align(void *p)
-{
-	uint32_t x;
-	memcpy(&x, p, sizeof(x));
-	return ntohl(x);
-}
-#define ntoh_s(var) ntoh_s_force_align(&(var))
-#define ntoh_l(var) ntoh_l_force_align(&(var))
-#endif
-
-static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk,
-						   unsigned int flags,
-						   const char *name,
-						   size_t len)
-{
-	struct cache_entry *ce = xmalloc(cache_entry_size(len));
-
-	ce->ce_stat_data.sd_ctime.sec = ntoh_l(ondisk->ctime.sec);
-	ce->ce_stat_data.sd_mtime.sec = ntoh_l(ondisk->mtime.sec);
-	ce->ce_stat_data.sd_ctime.nsec = ntoh_l(ondisk->ctime.nsec);
-	ce->ce_stat_data.sd_mtime.nsec = ntoh_l(ondisk->mtime.nsec);
-	ce->ce_stat_data.sd_dev   = ntoh_l(ondisk->dev);
-	ce->ce_stat_data.sd_ino   = ntoh_l(ondisk->ino);
-	ce->ce_mode  = ntoh_l(ondisk->mode);
-	ce->ce_stat_data.sd_uid   = ntoh_l(ondisk->uid);
-	ce->ce_stat_data.sd_gid   = ntoh_l(ondisk->gid);
-	ce->ce_stat_data.sd_size  = ntoh_l(ondisk->size);
-	ce->ce_flags = flags & ~CE_NAMEMASK;
-	ce->ce_namelen = len;
-	hashcpy(ce->sha1, ondisk->sha1);
-	memcpy(ce->name, name, len);
-	ce->name[len] = '\0';
-	return ce;
-}
-
-/*
- * Adjacent cache entries tend to share the leading paths, so it makes
- * sense to only store the differences in later entries.  In the v4
- * on-disk format of the index, each on-disk cache entry stores the
- * number of bytes to be stripped from the end of the previous name,
- * and the bytes to append to the result, to come up with its name.
- */
-static unsigned long expand_name_field(struct strbuf *name, const char *cp_)
-{
-	const unsigned char *ep, *cp = (const unsigned char *)cp_;
-	size_t len = decode_varint(&cp);
-
-	if (name->len < len)
-		die("malformed name field in the index");
-	strbuf_remove(name, name->len - len, len);
-	for (ep = cp; *ep; ep++)
-		; /* find the end */
-	strbuf_add(name, cp, ep - cp);
-	return (const char *)ep + 1 - cp_;
-}
-
-static struct cache_entry *create_from_disk(struct ondisk_cache_entry *ondisk,
-					    unsigned long *ent_size,
-					    struct strbuf *previous_name)
-{
-	struct cache_entry *ce;
-	size_t len;
-	const char *name;
-	unsigned int flags;
-
-	/* On-disk flags are just 16 bits */
-	flags = ntoh_s(ondisk->flags);
-	len = flags & CE_NAMEMASK;
-
-	if (flags & CE_EXTENDED) {
-		struct ondisk_cache_entry_extended *ondisk2;
-		int extended_flags;
-		ondisk2 = (struct ondisk_cache_entry_extended *)ondisk;
-		extended_flags = ntoh_s(ondisk2->flags2) << 16;
-		/* We do not yet understand any bit out of CE_EXTENDED_FLAGS */
-		if (extended_flags & ~CE_EXTENDED_FLAGS)
-			die("Unknown index entry format %08x", extended_flags);
-		flags |= extended_flags;
-		name = ondisk2->name;
-	}
-	else
-		name = ondisk->name;
-
-	if (!previous_name) {
-		/* v3 and earlier */
-		if (len == CE_NAMEMASK)
-			len = strlen(name);
-		ce = cache_entry_from_ondisk(ondisk, flags, name, len);
-
-		*ent_size = ondisk_ce_size(ce);
-	} else {
-		unsigned long consumed;
-		consumed = expand_name_field(previous_name, name);
-		ce = cache_entry_from_ondisk(ondisk, flags,
-					     previous_name->buf,
-					     previous_name->len);
-
-		*ent_size = (name - ((char *)ondisk)) + consumed;
-	}
-	return ce;
-}
-
-static int read_index_v2(struct index_state *istate, void *mmap, unsigned long mmap_size)
-{
-	int i;
-	unsigned long src_offset;
-	struct cache_version_header *hdr;
-	struct cache_header *hdr_v2;
-	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
-
-	hdr = mmap;
-	hdr_v2 = (struct cache_header *)((char *)mmap + sizeof(*hdr));
-
-	istate->version = ntohl(hdr->hdr_version);
-	istate->cache_nr = ntohl(hdr_v2->hdr_entries);
-	istate->cache_alloc = alloc_nr(istate->cache_nr);
-	istate->cache = xcalloc(istate->cache_alloc, sizeof(*istate->cache));
-	istate->initialized = 1;
-
-	if (istate->version == 4)
-		previous_name = &previous_name_buf;
-	else
-		previous_name = NULL;
-
-	src_offset = sizeof(*hdr) + sizeof(*hdr_v2);
-	for (i = 0; i < istate->cache_nr; i++) {
-		struct ondisk_cache_entry *disk_ce;
-		struct cache_entry *ce;
-		unsigned long consumed;
-
-		disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
-		ce = create_from_disk(disk_ce, &consumed, previous_name);
-		set_index_entry(istate, i, ce);
-
-		src_offset += consumed;
-	}
-	strbuf_release(&previous_name_buf);
-
-	while (src_offset <= mmap_size - 20 - 8) {
-		/* After an array of active_nr index entries,
-		 * there can be arbitrary number of extended
-		 * sections, each of which is prefixed with
-		 * extension name (4-byte) and section length
-		 * in 4-byte network byte order.
-		 */
-		uint32_t extsize;
-		memcpy(&extsize, (char *)mmap + src_offset + 4, 4);
-		extsize = ntohl(extsize);
-		if (read_index_extension(istate,
-					 (const char *) mmap + src_offset,
-					 (char *) mmap + src_offset + 8,
-					 extsize) < 0)
-			goto unmap;
-		src_offset += 8;
-		src_offset += extsize;
-	}
-	return 0;
-unmap:
-	munmap(mmap, mmap_size);
-	die("index file corrupt");
-}
-
 /* remember to discard_cache() before reading a different cache! */
 int read_index_from(struct index_state *istate, const char *path)
 {
@@ -1510,6 +1291,7 @@ int read_index_from(struct index_state *istate, const char *path)
 	errno = ENOENT;
 	istate->timestamp.sec = 0;
 	istate->timestamp.nsec = 0;
+
 	fd = open(path, O_RDONLY);
 	if (fd < 0) {
 		if (errno == ENOENT)
@@ -1522,24 +1304,23 @@ int read_index_from(struct index_state *istate, const char *path)
 
 	errno = EINVAL;
 	mmap_size = xsize_t(st.st_size);
-	if (mmap_size < sizeof(struct cache_header) + 20)
-		die("index file smaller than expected");
-
 	mmap = xmmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
 	close(fd);
 	if (mmap == MAP_FAILED)
 		die_errno("unable to map index file");
 
 	hdr = mmap;
-	if (verify_hdr_version(hdr, mmap_size) < 0)
+	if (verify_hdr_version(istate, hdr, mmap_size) < 0)
 		goto unmap;
 
-	if (verify_hdr(mmap, mmap_size) < 0)
+	if (istate->ops->verify_hdr(mmap, mmap_size) < 0)
 		goto unmap;
 
-	read_index_v2(istate, mmap, mmap_size);
+	if (istate->ops->read_index(istate, mmap, mmap_size) < 0)
+		goto unmap;
 	istate->timestamp.sec = st.st_mtime;
 	istate->timestamp.nsec = ST_MTIME_NSEC(st);
+
 	munmap(mmap, mmap_size);
 	return istate->cache_nr;
 
@@ -1583,201 +1364,6 @@ int unmerged_index(const struct index_state *istate)
 	return 0;
 }
 
-#define WRITE_BUFFER_SIZE 8192
-static unsigned char write_buffer[WRITE_BUFFER_SIZE];
-static unsigned long write_buffer_len;
-
-static int ce_write_flush(git_SHA_CTX *context, int fd)
-{
-	unsigned int buffered = write_buffer_len;
-	if (buffered) {
-		git_SHA1_Update(context, write_buffer, buffered);
-		if (write_in_full(fd, write_buffer, buffered) != buffered)
-			return -1;
-		write_buffer_len = 0;
-	}
-	return 0;
-}
-
-static int ce_write(git_SHA_CTX *context, int fd, void *data, unsigned int len)
-{
-	while (len) {
-		unsigned int buffered = write_buffer_len;
-		unsigned int partial = WRITE_BUFFER_SIZE - buffered;
-		if (partial > len)
-			partial = len;
-		memcpy(write_buffer + buffered, data, partial);
-		buffered += partial;
-		if (buffered == WRITE_BUFFER_SIZE) {
-			write_buffer_len = buffered;
-			if (ce_write_flush(context, fd))
-				return -1;
-			buffered = 0;
-		}
-		write_buffer_len = buffered;
-		len -= partial;
-		data = (char *) data + partial;
-	}
-	return 0;
-}
-
-static int write_index_ext_header(git_SHA_CTX *context, int fd,
-				  unsigned int ext, unsigned int sz)
-{
-	ext = htonl(ext);
-	sz = htonl(sz);
-	return ((ce_write(context, fd, &ext, 4) < 0) ||
-		(ce_write(context, fd, &sz, 4) < 0)) ? -1 : 0;
-}
-
-static int ce_flush(git_SHA_CTX *context, int fd)
-{
-	unsigned int left = write_buffer_len;
-
-	if (left) {
-		write_buffer_len = 0;
-		git_SHA1_Update(context, write_buffer, left);
-	}
-
-	/* Flush first if not enough space for SHA1 signature */
-	if (left + 20 > WRITE_BUFFER_SIZE) {
-		if (write_in_full(fd, write_buffer, left) != left)
-			return -1;
-		left = 0;
-	}
-
-	/* Append the SHA1 signature at the end */
-	git_SHA1_Final(write_buffer + left, context);
-	left += 20;
-	return (write_in_full(fd, write_buffer, left) != left) ? -1 : 0;
-}
-
-static void ce_smudge_racily_clean_entry(struct cache_entry *ce)
-{
-	/*
-	 * The only thing we care about in this function is to smudge the
-	 * falsely clean entry due to touch-update-touch race, so we leave
-	 * everything else as they are.  We are called for entries whose
-	 * ce_stat_data.sd_mtime match the index file mtime.
-	 *
-	 * Note that this actually does not do much for gitlinks, for
-	 * which ce_match_stat_basic() always goes to the actual
-	 * contents.  The caller checks with is_racy_timestamp() which
-	 * always says "no" for gitlinks, so we are not called for them ;-)
-	 */
-	struct stat st;
-
-	if (lstat(ce->name, &st) < 0)
-		return;
-	if (ce_match_stat_basic(ce, &st))
-		return;
-	if (ce_modified_check_fs(ce, &st)) {
-		/* This is "racily clean"; smudge it.  Note that this
-		 * is a tricky code.  At first glance, it may appear
-		 * that it can break with this sequence:
-		 *
-		 * $ echo xyzzy >frotz
-		 * $ git-update-index --add frotz
-		 * $ : >frotz
-		 * $ sleep 3
-		 * $ echo filfre >nitfol
-		 * $ git-update-index --add nitfol
-		 *
-		 * but it does not.  When the second update-index runs,
-		 * it notices that the entry "frotz" has the same timestamp
-		 * as index, and if we were to smudge it by resetting its
-		 * size to zero here, then the object name recorded
-		 * in index is the 6-byte file but the cached stat information
-		 * becomes zero --- which would then match what we would
-		 * obtain from the filesystem next time we stat("frotz").
-		 *
-		 * However, the second update-index, before calling
-		 * this function, notices that the cached size is 6
-		 * bytes and what is on the filesystem is an empty
-		 * file, and never calls us, so the cached size information
-		 * for "frotz" stays 6 which does not match the filesystem.
-		 */
-		ce->ce_stat_data.sd_size = 0;
-	}
-}
-
-/* Copy miscellaneous fields but not the name */
-static char *copy_cache_entry_to_ondisk(struct ondisk_cache_entry *ondisk,
-				       struct cache_entry *ce)
-{
-	short flags;
-
-	ondisk->ctime.sec = htonl(ce->ce_stat_data.sd_ctime.sec);
-	ondisk->mtime.sec = htonl(ce->ce_stat_data.sd_mtime.sec);
-	ondisk->ctime.nsec = htonl(ce->ce_stat_data.sd_ctime.nsec);
-	ondisk->mtime.nsec = htonl(ce->ce_stat_data.sd_mtime.nsec);
-	ondisk->dev  = htonl(ce->ce_stat_data.sd_dev);
-	ondisk->ino  = htonl(ce->ce_stat_data.sd_ino);
-	ondisk->mode = htonl(ce->ce_mode);
-	ondisk->uid  = htonl(ce->ce_stat_data.sd_uid);
-	ondisk->gid  = htonl(ce->ce_stat_data.sd_gid);
-	ondisk->size = htonl(ce->ce_stat_data.sd_size);
-	hashcpy(ondisk->sha1, ce->sha1);
-
-	flags = ce->ce_flags;
-	flags |= (ce_namelen(ce) >= CE_NAMEMASK ? CE_NAMEMASK : ce_namelen(ce));
-	ondisk->flags = htons(flags);
-	if (ce->ce_flags & CE_EXTENDED) {
-		struct ondisk_cache_entry_extended *ondisk2;
-		ondisk2 = (struct ondisk_cache_entry_extended *)ondisk;
-		ondisk2->flags2 = htons((ce->ce_flags & CE_EXTENDED_FLAGS) >> 16);
-		return ondisk2->name;
-	}
-	else {
-		return ondisk->name;
-	}
-}
-
-static int ce_write_entry(git_SHA_CTX *c, int fd, struct cache_entry *ce,
-			  struct strbuf *previous_name)
-{
-	int size;
-	struct ondisk_cache_entry *ondisk;
-	char *name;
-	int result;
-
-	if (!previous_name) {
-		size = ondisk_ce_size(ce);
-		ondisk = xcalloc(1, size);
-		name = copy_cache_entry_to_ondisk(ondisk, ce);
-		memcpy(name, ce->name, ce_namelen(ce));
-	} else {
-		int common, to_remove, prefix_size;
-		unsigned char to_remove_vi[16];
-		for (common = 0;
-		     (ce->name[common] &&
-		      common < previous_name->len &&
-		      ce->name[common] == previous_name->buf[common]);
-		     common++)
-			; /* still matching */
-		to_remove = previous_name->len - common;
-		prefix_size = encode_varint(to_remove, to_remove_vi);
-
-		if (ce->ce_flags & CE_EXTENDED)
-			size = offsetof(struct ondisk_cache_entry_extended, name);
-		else
-			size = offsetof(struct ondisk_cache_entry, name);
-		size += prefix_size + (ce_namelen(ce) - common + 1);
-
-		ondisk = xcalloc(1, size);
-		name = copy_cache_entry_to_ondisk(ondisk, ce);
-		memcpy(name, to_remove_vi, prefix_size);
-		memcpy(name + prefix_size, ce->name + common, ce_namelen(ce) - common);
-
-		strbuf_splice(previous_name, common, to_remove,
-			      ce->name + common, ce_namelen(ce) - common);
-	}
-
-	result = ce_write(c, fd, ondisk, size);
-	free(ondisk);
-	return result;
-}
-
 static int has_racy_timestamp(struct index_state *istate)
 {
 	int entries = istate->cache_nr;
@@ -1803,95 +1389,10 @@ void update_index_if_able(struct index_state *istate, struct lock_file *lockfile
 		rollback_lock_file(lockfile);
 }
 
-static int write_index_v2(struct index_state *istate, int newfd)
-{
-	git_SHA_CTX c;
-	struct cache_version_header hdr;
-	struct cache_header hdr_v2;
-	int i, err, removed, extended, hdr_version;
-	struct cache_entry **cache = istate->cache;
-	int entries = istate->cache_nr;
-	struct stat st;
-	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
-
-	for (i = removed = extended = 0; i < entries; i++) {
-		if (cache[i]->ce_flags & CE_REMOVE)
-			removed++;
-
-		/* reduce extended entries if possible */
-		cache[i]->ce_flags &= ~CE_EXTENDED;
-		if (cache[i]->ce_flags & CE_EXTENDED_FLAGS) {
-			extended++;
-			cache[i]->ce_flags |= CE_EXTENDED;
-		}
-	}
-
-	if (!istate->version)
-		istate->version = INDEX_FORMAT_DEFAULT;
-
-	/* demote version 3 to version 2 when the latter suffices */
-	if (istate->version == 3 || istate->version == 2)
-		istate->version = extended ? 3 : 2;
-
-	hdr_version = istate->version;
-
-	hdr.hdr_signature = htonl(CACHE_SIGNATURE);
-	hdr.hdr_version = htonl(hdr_version);
-	hdr_v2.hdr_entries = htonl(entries - removed);
-
-	git_SHA1_Init(&c);
-	if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0)
-		return -1;
-	if (ce_write(&c, newfd, &hdr_v2, sizeof(hdr_v2)) < 0)
-		return -1;
-
-	previous_name = (hdr_version == 4) ? &previous_name_buf : NULL;
-	for (i = 0; i < entries; i++) {
-		struct cache_entry *ce = cache[i];
-		if (ce->ce_flags & CE_REMOVE)
-			continue;
-		if (!ce_uptodate(ce) && is_racy_timestamp(istate, ce))
-			ce_smudge_racily_clean_entry(ce);
-		if (is_null_sha1(ce->sha1))
-			return error("cache entry has null sha1: %s", ce->name);
-		if (ce_write_entry(&c, newfd, ce, previous_name) < 0)
-			return -1;
-	}
-	strbuf_release(&previous_name_buf);
-
-	/* Write extension data here */
-	if (istate->cache_tree) {
-		struct strbuf sb = STRBUF_INIT;
-
-		cache_tree_write(&sb, istate->cache_tree);
-		err = write_index_ext_header(&c, newfd, CACHE_EXT_TREE, sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
-		strbuf_release(&sb);
-		if (err)
-			return -1;
-	}
-	if (istate->resolve_undo) {
-		struct strbuf sb = STRBUF_INIT;
-
-		resolve_undo_write(&sb, istate->resolve_undo);
-		err = write_index_ext_header(&c, newfd, CACHE_EXT_RESOLVE_UNDO,
-					     sb.len) < 0
-			|| ce_write(&c, newfd, sb.buf, sb.len) < 0;
-		strbuf_release(&sb);
-		if (err)
-			return -1;
-	}
-
-	if (ce_flush(&c, newfd) || fstat(newfd, &st))
-		return -1;
-	istate->timestamp.sec = (unsigned int)st.st_mtime;
-	istate->timestamp.nsec = ST_MTIME_NSEC(st);
-	return 0;
-}
-
 int write_index(struct index_state *istate, int newfd)
 {
-	return write_index_v2(istate, newfd);
+	set_istate_ops(istate);
+	return istate->ops->write_index(istate, newfd);
 }
 
 /*
diff --git a/read-cache.h b/read-cache.h
new file mode 100644
index 0000000..f31a38b
--- /dev/null
+++ b/read-cache.h
@@ -0,0 +1,57 @@
+/* Index extensions.
+ *
+ * The first letter should be 'A'..'Z' for extensions that are not
+ * necessary for a correct operation (i.e. optimization data).
+ * When new extensions are added that _needs_ to be understood in
+ * order to correctly interpret the index file, pick character that
+ * is outside the range, to cause the reader to abort.
+ */
+
+#define CACHE_EXT(s) ( (s[0]<<24)|(s[1]<<16)|(s[2]<<8)|(s[3]) )
+#define CACHE_EXT_TREE 0x54524545	/* "TREE" */
+#define CACHE_EXT_RESOLVE_UNDO 0x52455543 /* "REUC" */
+
+#define INDEX_FORMAT_DEFAULT 3
+
+/*
+ * Basic data structures for the directory cache
+ */
+struct cache_version_header {
+	unsigned int hdr_signature;
+	unsigned int hdr_version;
+};
+
+struct index_ops {
+	int (*match_stat_basic)(const struct cache_entry *ce, struct stat *st, int changed);
+	int (*verify_hdr)(void *mmap, unsigned long size);
+	int (*read_index)(struct index_state *istate, void *mmap, unsigned long mmap_size);
+	int (*write_index)(struct index_state *istate, int newfd);
+};
+
+extern struct index_ops v2_ops;
+
+#ifndef NEEDS_ALIGNED_ACCESS
+#define ntoh_s(var) ntohs(var)
+#define ntoh_l(var) ntohl(var)
+#else
+static inline uint16_t ntoh_s_force_align(void *p)
+{
+	uint16_t x;
+	memcpy(&x, p, sizeof(x));
+	return ntohs(x);
+}
+static inline uint32_t ntoh_l_force_align(void *p)
+{
+	uint32_t x;
+	memcpy(&x, p, sizeof(x));
+	return ntohl(x);
+}
+#define ntoh_s(var) ntoh_s_force_align(&(var))
+#define ntoh_l(var) ntoh_l_force_align(&(var))
+#endif
+
+extern int ce_modified_check_fs(const struct cache_entry *ce, struct stat *st);
+extern int ce_match_stat_basic(struct index_state *istate, const struct cache_entry *ce,
+			       struct stat *st);
+extern int is_racy_timestamp(const struct index_state *istate, const struct cache_entry *ce);
+extern void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce);
diff --git a/test-index-version.c b/test-index-version.c
index 4c0386f..65545a7 100644
--- a/test-index-version.c
+++ b/test-index-version.c
@@ -1,5 +1,10 @@
 #include "cache.h"
 
+struct cache_version_header {
+	unsigned int hdr_signature;
+	unsigned int hdr_version;
+};
+
 int main(int argc, char **argv)
 {
 	struct cache_version_header hdr;
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 04/22] read-cache: Re-read index if index file changed
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (2 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 03/22] read-cache: move index v2 specific functions to their own file Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 05/22] read-cache: add index reading api Thomas Gummerer
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Add the possibility of re-reading the index file, if it changed
while reading.

The index file might change during the read, causing outdated
information to be displayed. We check if the index file changed
by using its stat data as heuristic.

Helped-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 read-cache.c | 91 +++++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 57 insertions(+), 34 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 1e7ffc2..3e3a0e2 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1275,11 +1275,31 @@ int read_index(struct index_state *istate)
 	return read_index_from(istate, get_index_file());
 }
 
+static int index_changed(struct stat *st_old, struct stat *st_new)
+{
+	if (st_old->st_mtime != st_new->st_mtime ||
+#if !defined (__CYGWIN__)
+	    st_old->st_uid   != st_new->st_uid ||
+	    st_old->st_gid   != st_new->st_gid ||
+	    st_old->st_ino   != st_new->st_ino ||
+#endif
+#if USE_NSEC
+	    ST_MTIME_NSEC(*st_old) != ST_MTIME_NSEC(*st_new) ||
+#endif
+#if USE_STDEV
+	    st_old->st_dev != st_new->st_dev ||
+#endif
+	    st_old->st_size != st_new->st_size)
+		return 1;
+
+	return 0;
+}
+
 /* remember to discard_cache() before reading a different cache! */
 int read_index_from(struct index_state *istate, const char *path)
 {
-	int fd;
-	struct stat st;
+	int fd, err, i;
+	struct stat st_old, st_new;
 	struct cache_version_header *hdr;
 	void *mmap;
 	size_t mmap_size;
@@ -1291,41 +1311,44 @@ int read_index_from(struct index_state *istate, const char *path)
 	errno = ENOENT;
 	istate->timestamp.sec = 0;
 	istate->timestamp.nsec = 0;
+	for (i = 0; i < 50; i++) {
+		err = 0;
+		fd = open(path, O_RDONLY);
+		if (fd < 0) {
+			if (errno == ENOENT)
+				return 0;
+			die_errno("index file open failed");
+		}
 
-	fd = open(path, O_RDONLY);
-	if (fd < 0) {
-		if (errno == ENOENT)
-			return 0;
-		die_errno("index file open failed");
+		if (fstat(fd, &st_old))
+			die_errno("cannot stat the open index");
+
+		errno = EINVAL;
+		mmap_size = xsize_t(st_old.st_size);
+		mmap = xmmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
+		close(fd);
+		if (mmap == MAP_FAILED)
+			die_errno("unable to map index file");
+
+		hdr = mmap;
+		if (verify_hdr_version(istate, hdr, mmap_size) < 0)
+			err = 1;
+
+		if (istate->ops->verify_hdr(mmap, mmap_size) < 0)
+			err = 1;
+
+		if (istate->ops->read_index(istate, mmap, mmap_size) < 0)
+			err = 1;
+		istate->timestamp.sec = st_old.st_mtime;
+		istate->timestamp.nsec = ST_MTIME_NSEC(st_old);
+		if (lstat(path, &st_new))
+			die_errno("cannot stat the open index");
+
+		munmap(mmap, mmap_size);
+		if (!index_changed(&st_old, &st_new) && !err)
+			return istate->cache_nr;
 	}
 
-	if (fstat(fd, &st))
-		die_errno("cannot stat the open index");
-
-	errno = EINVAL;
-	mmap_size = xsize_t(st.st_size);
-	mmap = xmmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
-	close(fd);
-	if (mmap == MAP_FAILED)
-		die_errno("unable to map index file");
-
-	hdr = mmap;
-	if (verify_hdr_version(istate, hdr, mmap_size) < 0)
-		goto unmap;
-
-	if (istate->ops->verify_hdr(mmap, mmap_size) < 0)
-		goto unmap;
-
-	if (istate->ops->read_index(istate, mmap, mmap_size) < 0)
-		goto unmap;
-	istate->timestamp.sec = st.st_mtime;
-	istate->timestamp.nsec = ST_MTIME_NSEC(st);
-
-	munmap(mmap, mmap_size);
-	return istate->cache_nr;
-
-unmap:
-	munmap(mmap, mmap_size);
 	die("index file corrupt");
 }
 
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 05/22] read-cache: add index reading api
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (3 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 04/22] read-cache: Re-read index if index file changed Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-08  2:01   ` Duy Nguyen
                     ` (2 more replies)
  2013-07-07  8:11 ` [PATCH 06/22] make sure partially read index is not changed Thomas Gummerer
                   ` (16 subsequent siblings)
  21 siblings, 3 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Add an api for access to the index file.  Currently there is only a very
basic api for accessing the index file, which only allows a full read of
the index, and lets the users of the data filter it.  The new index api
gives the users the possibility to use only part of the index and
provides functions for iterating over and accessing cache entries.

This simplifies future improvements to the in-memory format, as changes
will be concentrated on one file, instead of the whole git source code.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 cache.h         |  57 +++++++++++++++++++++++++++++-
 read-cache-v2.c |  96 +++++++++++++++++++++++++++++++++++++++++++++++--
 read-cache.c    | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 read-cache.h    |  12 ++++++-
 4 files changed, 263 insertions(+), 10 deletions(-)

diff --git a/cache.h b/cache.h
index 5082b34..d38dfbd 100644
--- a/cache.h
+++ b/cache.h
@@ -127,7 +127,8 @@ struct cache_entry {
 	unsigned int ce_flags;
 	unsigned int ce_namelen;
 	unsigned char sha1[20];
-	struct cache_entry *next;
+	struct cache_entry *next; /* used by name_hash */
+	struct cache_entry *next_ce; /* used to keep a list of cache entries */
 	char name[FLEX_ARRAY]; /* more */
 };
 
@@ -258,6 +259,32 @@ static inline unsigned int canon_mode(unsigned int mode)
 
 #define cache_entry_size(len) (offsetof(struct cache_entry,name) + (len) + 1)
 
+/*
+ * Options by which the index should be filtered when read partially.
+ *
+ * pathspec: The pathspec which the index entries have to match
+ * seen: Used to return the seen parameter from match_pathspec()
+ * max_prefix, max_prefix_len: These variables are set to the longest
+ *     common prefix, the length of the longest common prefix of the
+ *     given pathspec
+ *
+ * read_staged: used to indicate if the conflicted entries (entries
+ *     with a stage) should be included
+ * read_cache_tree: used to indicate if the cache-tree should be read
+ * read_resolve_undo: used to indicate if the resolve undo data should
+ *     be read
+ */
+struct filter_opts {
+	const char **pathspec;
+	char *seen;
+	char *max_prefix;
+	int max_prefix_len;
+
+	int read_staged;
+	int read_cache_tree;
+	int read_resolve_undo;
+};
+
 struct index_state {
 	struct cache_entry **cache;
 	unsigned int version;
@@ -270,6 +297,8 @@ struct index_state {
 	struct hash_table name_hash;
 	struct hash_table dir_hash;
 	struct index_ops *ops;
+	struct internal_ops *internal_ops;
+	struct filter_opts *filter_opts;
 };
 
 extern struct index_state the_index;
@@ -311,6 +340,17 @@ extern void free_name_hash(struct index_state *istate);
 #define unmerge_cache_entry_at(at) unmerge_index_entry_at(&the_index, at)
 #define unmerge_cache(pathspec) unmerge_index(&the_index, pathspec)
 #define read_blob_data_from_cache(path, sz) read_blob_data_from_index(&the_index, (path), (sz))
+
+/* index api */
+#define read_cache_filtered(opts) read_index_filtered(&the_index, (opts))
+#define read_cache_filtered_from(path, opts) read_index_filtered_from(&the_index, (path), (opts))
+#define get_cache_entry_by_name(name, namelen, ce) \
+	get_index_entry_by_name(&the_index, (name), (namelen), (ce))
+#define for_each_cache_entry(fn, cb_data) \
+	for_each_index_entry(&the_index, (fn), (cb_data))
+#define next_cache_entry(ce) next_index_entry(ce)
+#define cache_change_filter_opts(opts) index_change_filter_opts(&the_index, (opts))
+#define sort_cache() sort_index(&the_index)
 #endif
 
 enum object_type {
@@ -438,6 +478,21 @@ extern int init_db(const char *template_dir, unsigned int flags);
 		} \
 	} while (0)
 
+/* index api */
+extern int read_index_filtered(struct index_state *, struct filter_opts *opts);
+extern int read_index_filtered_from(struct index_state *, const char *path, struct filter_opts *opts);
+extern int get_index_entry_by_name(struct index_state *, const char *name, int namelen,
+				   struct cache_entry **ce);
+extern struct cache_entry *next_index_entry(struct cache_entry *ce);
+void index_change_filter_opts(struct index_state *istate, struct filter_opts *opts);
+void sort_index(struct index_state *istate);
+
+typedef int each_cache_entry_fn(struct cache_entry *ce, void *);
+
+extern int for_each_index_entry(struct index_state *istate,
+				each_cache_entry_fn, void *);
+
+
 /* Initialize and use the cache information */
 extern int read_index(struct index_state *);
 extern int read_index_preload(struct index_state *, const char **pathspec);
diff --git a/read-cache-v2.c b/read-cache-v2.c
index a6883c3..1ed640d 100644
--- a/read-cache-v2.c
+++ b/read-cache-v2.c
@@ -3,6 +3,7 @@
 #include "resolve-undo.h"
 #include "cache-tree.h"
 #include "varint.h"
+#include "dir.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 #define CE_NAMEMASK  (0x0fff)
@@ -117,6 +118,7 @@ static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *on
 	hashcpy(ce->sha1, ondisk->sha1);
 	memcpy(ce->name, name, len);
 	ce->name[len] = '\0';
+	ce->next_ce = NULL;
 	return ce;
 }
 
@@ -207,14 +209,21 @@ static int read_index_extension(struct index_state *istate,
 	return 0;
 }
 
+/*
+ * The performance is the same if we read the whole index or only
+ * part of it, therefore we always read the whole index to avoid
+ * having to re-read it later.  The filter_opts will determine
+ * what part of the index is used when retrieving the cache-entries.
+ */
 static int read_index_v2(struct index_state *istate, void *mmap,
-			 unsigned long mmap_size)
+			 unsigned long mmap_size, struct filter_opts *opts)
 {
 	int i;
 	unsigned long src_offset;
 	struct cache_version_header *hdr;
 	struct cache_header *hdr_v2;
 	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
+	struct cache_entry *prev = NULL;
 
 	hdr = mmap;
 	hdr_v2 = (struct cache_header *)((char *)mmap + sizeof(*hdr));
@@ -237,9 +246,12 @@ static int read_index_v2(struct index_state *istate, void *mmap,
 
 		disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
 		ce = create_from_disk(disk_ce, &consumed, previous_name);
+		if (prev)
+			prev->next_ce = ce;
 		set_index_entry(istate, i, ce);
 
 		src_offset += consumed;
+		prev = ce;
 	}
 	strbuf_release(&previous_name_buf);
 
@@ -267,6 +279,16 @@ unmap:
 	die("index file corrupt");
 }
 
+static void index_change_filter_opts_v2(struct index_state *istate, struct filter_opts *opts)
+{
+	/*
+	 * We don't need to re-read anything, because in index v2 we
+	 * read the whole index up front.  Just change the options by
+	 * which the index is filtered when accessing it.
+	 */
+	istate->filter_opts = opts;
+}
+
 #define WRITE_BUFFER_SIZE 8192
 static unsigned char write_buffer[WRITE_BUFFER_SIZE];
 static unsigned long write_buffer_len;
@@ -548,9 +570,79 @@ static int write_index_v2(struct index_state *istate, int newfd)
 	return 0;
 }
 
+int for_each_index_entry_v2(struct index_state *istate, each_cache_entry_fn fn, void *cb_data)
+{
+	int i, ret = 0;
+	struct filter_opts *opts= istate->filter_opts;
+
+	for (i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+
+		if (opts && !opts->read_staged && ce_stage(ce))
+			continue;
+
+		if (opts && !match_pathspec(opts->pathspec, ce->name, ce_namelen(ce),
+					    opts->max_prefix_len, opts->seen))
+			continue;
+
+		if ((ret = fn(istate->cache[i], cb_data)))
+			break;
+	}
+	if (opts && !opts->max_prefix) {
+		opts->max_prefix = common_prefix(opts->pathspec);
+		opts->max_prefix_len = opts->max_prefix ? strlen(opts->max_prefix) : 0;
+	}
+
+	return ret;
+}
+
+int get_index_entry_by_name_v2(struct index_state *istate, const char *name, int namelen,
+			       struct cache_entry **ce)
+{
+	int pos = index_name_pos(istate, name, namelen);
+
+	*ce = NULL;
+	if (0 <= pos) {
+		*ce = istate->cache[pos];
+		return 1;
+	}
+	pos = -pos - 1;
+
+	if (pos < istate->cache_nr)
+		*ce = istate->cache[pos];
+	return 0;
+}
+
+static int cmp_cache_name_compare(const void *a_, const void *b_)
+{
+	const struct cache_entry *ce1, *ce2;
+
+	ce1 = *((const struct cache_entry **)a_);
+	ce2 = *((const struct cache_entry **)b_);
+	return cache_name_stage_compare(ce1->name, ce1->ce_namelen, ce_stage(ce1),
+					ce2->name, ce2->ce_namelen, ce_stage(ce2));
+}
+
+void sort_index_v2(struct index_state *istate)
+{
+	/*
+	 * Nuke the cache-tree first, as it will no longer be up to date
+	 */
+	cache_tree_free(&istate->cache_tree);
+	qsort(istate->cache, istate->cache_nr, sizeof(istate->cache[0]),
+	      cmp_cache_name_compare);
+}
+
 struct index_ops v2_ops = {
 	match_stat_basic,
 	verify_hdr,
 	read_index_v2,
-	write_index_v2
+	write_index_v2,
+	index_change_filter_opts_v2
+};
+
+struct internal_ops v2_internal_ops = {
+	for_each_index_entry_v2,
+	get_index_entry_by_name_v2,
+	sort_index_v2
 };
diff --git a/read-cache.c b/read-cache.c
index 3e3a0e2..b30ee75 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -471,6 +471,8 @@ int remove_index_entry_at(struct index_state *istate, int pos)
 	remove_name_hash(istate, ce);
 	istate->cache_changed = 1;
 	istate->cache_nr--;
+	if (pos > 0)
+		istate->cache[pos - 1]->next_ce = ce->next_ce;
 	if (pos >= istate->cache_nr)
 		return 0;
 	memmove(istate->cache + pos,
@@ -490,10 +492,13 @@ void remove_marked_cache_entries(struct index_state *istate)
 	unsigned int i, j;
 
 	for (i = j = 0; i < istate->cache_nr; i++) {
-		if (ce_array[i]->ce_flags & CE_REMOVE)
+		if (ce_array[i]->ce_flags & CE_REMOVE) {
 			remove_name_hash(istate, ce_array[i]);
-		else
+		} else {
+			if (j > 0)
+				ce_array[j - 1]->next_ce = ce_array[i]->next_ce;
 			ce_array[j++] = ce_array[i];
+		}
 	}
 	istate->cache_changed = 1;
 	istate->cache_nr = j;
@@ -996,6 +1001,13 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
 		memmove(istate->cache + pos + 1,
 			istate->cache + pos,
 			(istate->cache_nr - pos - 1) * sizeof(ce));
+
+	if (pos + 1 >= istate->cache_nr)
+		ce->next_ce = NULL;
+	else
+		ce->next_ce = istate->cache[pos]->next_ce;
+	if (pos > 0)
+		istate->cache[pos - 1]->next_ce = ce;
 	set_index_entry(istate, pos, ce);
 	istate->cache_changed = 1;
 	return 0;
@@ -1272,7 +1284,82 @@ static int verify_hdr_version(struct index_state *istate,
 
 int read_index(struct index_state *istate)
 {
-	return read_index_from(istate, get_index_file());
+	return read_index_filtered_from(istate, get_index_file(), NULL);
+}
+
+int read_index_filtered(struct index_state *istate, struct filter_opts *opts)
+{
+	return read_index_filtered_from(istate, get_index_file(), opts);
+}
+
+int set_internal_ops(struct index_state *istate)
+{
+	if (!istate->internal_ops && istate->cache)
+		istate->internal_ops = &v2_internal_ops;
+	if (!istate->internal_ops)
+		return 0;
+	return 1;
+}
+
+/*
+ * Execute fn for each index entry which is currently in istate.  Data
+ * can be given to the function using the cb_data parameter.
+ */
+int for_each_index_entry(struct index_state *istate, each_cache_entry_fn fn, void *cb_data)
+{
+	if (!set_internal_ops(istate))
+		return 0;
+	return istate->internal_ops->for_each_index_entry(istate, fn, cb_data);
+}
+
+/*
+ * Search for an index entry by its name.
+ * The cache entry is returned using the the ce parameter.
+ * Returns: 1 if a cache-entry was an exact match
+ *          0 if the name of a cache-entry was partially matched.  The
+ *            first cache-entry that matches is returned using the ce
+ *            parameter.  Finding the cache-entry that is needed is left
+ *            to the caller.
+ */
+int get_index_entry_by_name(struct index_state *istate, const char *name, int namelen,
+			    struct cache_entry **ce)
+{
+	if (!set_internal_ops(istate)) {
+		*ce = NULL;
+		return 0;
+	}
+	return istate->internal_ops->get_index_entry_by_name(istate, name, namelen, ce);
+}
+
+/*
+ * Return the next index entry, using the given index entry.  Use this
+ * if there is the need to iterate from a given cache-entry.
+ */
+struct cache_entry *next_index_entry(struct cache_entry *ce)
+{
+	return ce->next_ce;
+}
+
+/*
+ * Sorts the index from an unordered list
+ */
+void sort_index(struct index_state *istate)
+{
+	if (!set_internal_ops(istate))
+		return;
+	istate->internal_ops->sort_index(istate);
+}
+/*
+ * Change the filter_opts, and re-read the index if necessary
+ */
+void index_change_filter_opts(struct index_state *istate, struct filter_opts *opts)
+{
+	if (!istate->ops) {
+		/* Just re-read the index, we haven't read it before */
+		read_index_filtered(istate, opts);
+		return;
+	}
+	istate->ops->index_change_filter_opts(istate, opts);
 }
 
 static int index_changed(struct stat *st_old, struct stat *st_new)
@@ -1295,8 +1382,9 @@ static int index_changed(struct stat *st_old, struct stat *st_new)
 	return 0;
 }
 
-/* remember to discard_cache() before reading a different cache! */
-int read_index_from(struct index_state *istate, const char *path)
+
+int read_index_filtered_from(struct index_state *istate, const char *path,
+			     struct filter_opts *opts)
 {
 	int fd, err, i;
 	struct stat st_old, st_new;
@@ -1337,7 +1425,7 @@ int read_index_from(struct index_state *istate, const char *path)
 		if (istate->ops->verify_hdr(mmap, mmap_size) < 0)
 			err = 1;
 
-		if (istate->ops->read_index(istate, mmap, mmap_size) < 0)
+		if (istate->ops->read_index(istate, mmap, mmap_size, opts) < 0)
 			err = 1;
 		istate->timestamp.sec = st_old.st_mtime;
 		istate->timestamp.nsec = ST_MTIME_NSEC(st_old);
@@ -1345,6 +1433,7 @@ int read_index_from(struct index_state *istate, const char *path)
 			die_errno("cannot stat the open index");
 
 		munmap(mmap, mmap_size);
+		istate->filter_opts = opts;
 		if (!index_changed(&st_old, &st_new) && !err)
 			return istate->cache_nr;
 	}
@@ -1352,6 +1441,13 @@ int read_index_from(struct index_state *istate, const char *path)
 	die("index file corrupt");
 }
 
+
+/* remember to discard_cache() before reading a different cache! */
+int read_index_from(struct index_state *istate, const char *path)
+{
+	return read_index_filtered_from(istate, path, NULL);
+}
+
 int is_index_unborn(struct index_state *istate)
 {
 	return (!istate->cache_nr && !istate->timestamp.sec);
diff --git a/read-cache.h b/read-cache.h
index f31a38b..ce9b79c 100644
--- a/read-cache.h
+++ b/read-cache.h
@@ -24,11 +24,21 @@ struct cache_version_header {
 struct index_ops {
 	int (*match_stat_basic)(const struct cache_entry *ce, struct stat *st, int changed);
 	int (*verify_hdr)(void *mmap, unsigned long size);
-	int (*read_index)(struct index_state *istate, void *mmap, unsigned long mmap_size);
+	int (*read_index)(struct index_state *istate, void *mmap, unsigned long mmap_size,
+			  struct filter_opts *opts);
 	int (*write_index)(struct index_state *istate, int newfd);
+	void (*index_change_filter_opts)(struct index_state *istate, struct filter_opts *opts);
+};
+
+struct internal_ops {
+	int (*for_each_index_entry)(struct index_state *istate, each_cache_entry_fn fn, void *cb_data);
+	int (*get_index_entry_by_name)(struct index_state *istate, const char *name, int namelen,
+				       struct cache_entry **ce);
+	void (*sort_index)(struct index_state *istate);
 };
 
 extern struct index_ops v2_ops;
+extern struct internal_ops v2_internal_ops;
 
 #ifndef NEEDS_ALIGNED_ACCESS
 #define ntoh_s(var) ntohs(var)
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 06/22] make sure partially read index is not changed
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (4 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 05/22] read-cache: add index reading api Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-08 16:31   ` Junio C Hamano
  2013-07-07  8:11 ` [PATCH 07/22] dir.c: use index api Thomas Gummerer
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

A partially read index file currently cannot be written to disk.  Make
sure that never happens, by re-reading the index file if the index file
wasn't read completely before changing the in-memory index.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 builtin/update-index.c | 4 ++++
 cache.h                | 4 +++-
 read-cache-v2.c        | 3 +++
 read-cache.c           | 8 ++++++++
 4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/builtin/update-index.c b/builtin/update-index.c
index 5c7762e..03f6426 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -49,6 +49,8 @@ static int mark_ce_flags(const char *path, int flag, int mark)
 	int namelen = strlen(path);
 	int pos = cache_name_pos(path, namelen);
 	if (0 <= pos) {
+		if (active_cache_partially_read)
+			cache_change_filter_opts(NULL);
 		if (mark)
 			active_cache[pos]->ce_flags |= flag;
 		else
@@ -253,6 +255,8 @@ static void chmod_path(int flip, const char *path)
 	pos = cache_name_pos(path, strlen(path));
 	if (pos < 0)
 		goto fail;
+	if (active_cache_partially_read)
+		cache_change_filter_opts(NULL);
 	ce = active_cache[pos];
 	mode = ce->ce_mode;
 	if (!S_ISREG(mode))
diff --git a/cache.h b/cache.h
index d38dfbd..f6c3407 100644
--- a/cache.h
+++ b/cache.h
@@ -293,7 +293,8 @@ struct index_state {
 	struct cache_tree *cache_tree;
 	struct cache_time timestamp;
 	unsigned name_hash_initialized : 1,
-		 initialized : 1;
+		 initialized : 1,
+		 partially_read : 1;
 	struct hash_table name_hash;
 	struct hash_table dir_hash;
 	struct index_ops *ops;
@@ -315,6 +316,7 @@ extern void free_name_hash(struct index_state *istate);
 #define active_alloc (the_index.cache_alloc)
 #define active_cache_changed (the_index.cache_changed)
 #define active_cache_tree (the_index.cache_tree)
+#define active_cache_partially_read (the_index.partially_read)
 
 #define read_cache() read_index(&the_index)
 #define read_cache_from(path) read_index_from(&the_index, (path))
diff --git a/read-cache-v2.c b/read-cache-v2.c
index 1ed640d..2cc792d 100644
--- a/read-cache-v2.c
+++ b/read-cache-v2.c
@@ -273,6 +273,7 @@ static int read_index_v2(struct index_state *istate, void *mmap,
 		src_offset += 8;
 		src_offset += extsize;
 	}
+	istate->partially_read = 0;
 	return 0;
 unmap:
 	munmap(mmap, mmap_size);
@@ -495,6 +496,8 @@ static int write_index_v2(struct index_state *istate, int newfd)
 	struct stat st;
 	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
 
+	if (istate->partially_read)
+		die("BUG: index: cannot write a partially read index");
 	for (i = removed = extended = 0; i < entries; i++) {
 		if (cache[i]->ce_flags & CE_REMOVE)
 			removed++;
diff --git a/read-cache.c b/read-cache.c
index b30ee75..4529fab 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -30,6 +30,8 @@ static void replace_index_entry(struct index_state *istate, int nr, struct cache
 {
 	struct cache_entry *old = istate->cache[nr];
 
+	if (istate->partially_read)
+		index_change_filter_opts(istate, NULL);
 	remove_name_hash(istate, old);
 	set_index_entry(istate, nr, ce);
 	istate->cache_changed = 1;
@@ -467,6 +469,8 @@ int remove_index_entry_at(struct index_state *istate, int pos)
 {
 	struct cache_entry *ce = istate->cache[pos];
 
+	if (istate->partially_read)
+		index_change_filter_opts(istate, NULL);
 	record_resolve_undo(istate, ce);
 	remove_name_hash(istate, ce);
 	istate->cache_changed = 1;
@@ -978,6 +982,8 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
 {
 	int pos;
 
+	if (istate->partially_read)
+		index_change_filter_opts(istate, NULL);
 	if (option & ADD_CACHE_JUST_APPEND)
 		pos = istate->cache_nr;
 	else {
@@ -1184,6 +1190,8 @@ int refresh_index(struct index_state *istate, unsigned int flags, const char **p
 				/* If we are doing --really-refresh that
 				 * means the index is not valid anymore.
 				 */
+				if (istate->partially_read)
+					index_change_filter_opts(istate, NULL);
 				ce->ce_flags &= ~CE_VALID;
 				istate->cache_changed = 1;
 			}
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 07/22] dir.c: use index api
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (5 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 06/22] make sure partially read index is not changed Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 08/22] tree.c: " Thomas Gummerer
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 dir.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/dir.c b/dir.c
index 897c874..f4919ba 100644
--- a/dir.c
+++ b/dir.c
@@ -468,19 +468,19 @@ void add_exclude(const char *string, const char *base,
 
 static void *read_skip_worktree_file_from_index(const char *path, size_t *size)
 {
-	int pos, len;
+	int len;
 	unsigned long sz;
 	enum object_type type;
 	void *data;
 	struct index_state *istate = &the_index;
+	struct cache_entry *ce;
 
 	len = strlen(path);
-	pos = index_name_pos(istate, path, len);
-	if (pos < 0)
+	if (!get_index_entry_by_name(istate, path, len, &ce))
 		return NULL;
-	if (!ce_skip_worktree(istate->cache[pos]))
+	if (!ce_skip_worktree(ce))
 		return NULL;
-	data = read_sha1_file(istate->cache[pos]->sha1, &type, &sz);
+	data = read_sha1_file(ce->sha1, &type, &sz);
 	if (!data || type != OBJ_BLOB) {
 		free(data);
 		return NULL;
@@ -968,16 +968,13 @@ static enum exist_status directory_exists_in_index_icase(const char *dirname, in
  */
 static enum exist_status directory_exists_in_index(const char *dirname, int len)
 {
-	int pos;
+	struct cache_entry *ce;
 
 	if (ignore_case)
 		return directory_exists_in_index_icase(dirname, len);
 
-	pos = cache_name_pos(dirname, len);
-	if (pos < 0)
-		pos = -pos-1;
-	while (pos < active_nr) {
-		struct cache_entry *ce = active_cache[pos++];
+	get_cache_entry_by_name(dirname, len, &ce);
+	while (ce) {
 		unsigned char endchar;
 
 		if (strncmp(ce->name, dirname, len))
@@ -989,6 +986,7 @@ static enum exist_status directory_exists_in_index(const char *dirname, int len)
 			return index_directory;
 		if (!endchar && S_ISGITLINK(ce->ce_mode))
 			return index_gitdir;
+		ce = next_cache_entry(ce);
 	}
 	return index_nonexistent;
 }
@@ -1114,7 +1112,6 @@ static int exclude_matches_pathspec(const char *path, int len,
 
 static int get_index_dtype(const char *path, int len)
 {
-	int pos;
 	struct cache_entry *ce;
 
 	ce = cache_name_exists(path, len, 0);
@@ -1131,18 +1128,18 @@ static int get_index_dtype(const char *path, int len)
 	}
 
 	/* Try to look it up as a directory */
-	pos = cache_name_pos(path, len);
-	if (pos >= 0)
+	if (get_cache_entry_by_name(path, len, &ce));
 		return DT_UNKNOWN;
-	pos = -pos-1;
-	while (pos < active_nr) {
-		ce = active_cache[pos++];
+
+	while (ce) {
 		if (strncmp(ce->name, path, len))
 			break;
 		if (ce->name[len] > '/')
 			break;
-		if (ce->name[len] < '/')
+		if (ce->name[len] < '/') {
+			ce = next_cache_entry(ce);
 			continue;
+		}
 		if (!ce_uptodate(ce))
 			break;	/* continue? */
 		return DT_DIR;
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 08/22] tree.c: use index api
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (6 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 07/22] dir.c: use index api Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 09/22] name-hash.c: " Thomas Gummerer
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 tree.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/tree.c b/tree.c
index 62fed63..5cd43f4 100644
--- a/tree.c
+++ b/tree.c
@@ -128,20 +128,28 @@ int read_tree_recursive(struct tree *tree,
 	return ret;
 }
 
-static int cmp_cache_name_compare(const void *a_, const void *b_)
+
+struct read_tree_data {
+	read_tree_fn_t fn;
+	int stage;
+};
+
+int get_read_tree_fn(struct cache_entry *ce, void *cb_data)
 {
-	const struct cache_entry *ce1, *ce2;
+	struct read_tree_data *data = cb_data;
 
-	ce1 = *((const struct cache_entry **)a_);
-	ce2 = *((const struct cache_entry **)b_);
-	return cache_name_stage_compare(ce1->name, ce1->ce_namelen, ce_stage(ce1),
-				  ce2->name, ce2->ce_namelen, ce_stage(ce2));
+	if (ce_stage(ce) == data->stage) {
+		data->fn = read_one_entry;
+		return 0;
+	}
+	return 1;
 }
 
 int read_tree(struct tree *tree, int stage, struct pathspec *match)
 {
 	read_tree_fn_t fn = NULL;
-	int i, err;
+	int err;
+	struct read_tree_data rtd;
 
 	/*
 	 * Currently the only existing callers of this function all
@@ -158,11 +166,10 @@ int read_tree(struct tree *tree, int stage, struct pathspec *match)
 	 * do it the original slow way, otherwise, append and then
 	 * sort at the end.
 	 */
-	for (i = 0; !fn && i < active_nr; i++) {
-		struct cache_entry *ce = active_cache[i];
-		if (ce_stage(ce) == stage)
-			fn = read_one_entry;
-	}
+	rtd.fn = fn;
+	rtd.stage = stage;
+	for_each_cache_entry(get_read_tree_fn, &rtd);
+	fn = rtd.fn;
 
 	if (!fn)
 		fn = read_one_entry_quick;
@@ -170,12 +177,7 @@ int read_tree(struct tree *tree, int stage, struct pathspec *match)
 	if (fn == read_one_entry || err)
 		return err;
 
-	/*
-	 * Sort the cache entry -- we need to nuke the cache tree, though.
-	 */
-	cache_tree_free(&active_cache_tree);
-	qsort(active_cache, active_nr, sizeof(active_cache[0]),
-	      cmp_cache_name_compare);
+	sort_cache();
 	return 0;
 }
 
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 09/22] name-hash.c: use index api
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (7 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 08/22] tree.c: " Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 10/22] grep.c: Use " Thomas Gummerer
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 name-hash.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/name-hash.c b/name-hash.c
index 617c86c..6551849 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -144,16 +144,19 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
 		add_dir_entry(istate, ce);
 }
 
-static void lazy_init_name_hash(struct index_state *istate)
+static int hash_entry(struct cache_entry *ce, void *istate)
 {
-	int nr;
+	hash_index_entry((struct index_state *)istate, ce);
+	return 0;
+}
 
+static void lazy_init_name_hash(struct index_state *istate)
+{
 	if (istate->name_hash_initialized)
 		return;
 	if (istate->cache_nr)
 		preallocate_hash(&istate->name_hash, istate->cache_nr);
-	for (nr = 0; nr < istate->cache_nr; nr++)
-		hash_index_entry(istate, istate->cache[nr]);
+	for_each_index_entry(istate, hash_entry, istate);
 	istate->name_hash_initialized = 1;
 }
 
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 10/22] grep.c: Use index api
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (8 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 09/22] name-hash.c: " Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 11/22] ls-files.c: use the " Thomas Gummerer
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 builtin/grep.c | 71 ++++++++++++++++++++++++++++++----------------------------
 1 file changed, 37 insertions(+), 34 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index a419cda..2a1c8f4 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -368,41 +368,33 @@ static void run_pager(struct grep_opt *opt, const char *prefix)
 	free(argv);
 }
 
-static int grep_cache(struct grep_opt *opt, const struct pathspec *pathspec, int cached)
+struct grep_opts {
+	struct grep_opt *opt;
+	const struct pathspec *pathspec;
+	int cached;
+	int hit;
+};
+
+static int grep_cache(struct cache_entry *ce, void *cb_data)
 {
-	int hit = 0;
-	int nr;
-	read_cache();
+	struct grep_opts *opts = cb_data;
 
-	for (nr = 0; nr < active_nr; nr++) {
-		struct cache_entry *ce = active_cache[nr];
-		if (!S_ISREG(ce->ce_mode))
-			continue;
-		if (!match_pathspec_depth(pathspec, ce->name, ce_namelen(ce), 0, NULL))
-			continue;
-		/*
-		 * If CE_VALID is on, we assume worktree file and its cache entry
-		 * are identical, even if worktree file has been modified, so use
-		 * cache version instead
-		 */
-		if (cached || (ce->ce_flags & CE_VALID) || ce_skip_worktree(ce)) {
-			if (ce_stage(ce))
-				continue;
-			hit |= grep_sha1(opt, ce->sha1, ce->name, 0, ce->name);
-		}
-		else
-			hit |= grep_file(opt, ce->name);
-		if (ce_stage(ce)) {
-			do {
-				nr++;
-			} while (nr < active_nr &&
-				 !strcmp(ce->name, active_cache[nr]->name));
-			nr--; /* compensate for loop control */
-		}
-		if (hit && opt->status_only)
-			break;
-	}
-	return hit;
+	if (!S_ISREG(ce->ce_mode))
+		return 0;
+	if (!match_pathspec_depth(opts->pathspec, ce->name, ce_namelen(ce), 0, NULL))
+		return 0;
+	/*
+	 * If CE_VALID is on, we assume worktree file and its cache entry
+	 * are identical, even if worktree file has been modified, so use
+	 * cache version instead
+	 */
+	if (opts->cached || (ce->ce_flags & CE_VALID) || ce_skip_worktree(ce))
+		opts->hit |= grep_sha1(opts->opt, ce->sha1, ce->name, 0, ce->name);
+	else
+		opts->hit |= grep_file(opts->opt, ce->name);
+	if (opts->hit && opts->opt->status_only)
+		return 1;
+	return 0;
 }
 
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
@@ -895,10 +887,21 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 	} else if (0 <= opt_exclude) {
 		die(_("--[no-]exclude-standard cannot be used for tracked contents."));
 	} else if (!list.nr) {
+		struct grep_opts opts;
+		struct filter_opts *filter_opts = xmalloc(sizeof(*filter_opts));
+
 		if (!cached)
 			setup_work_tree();
 
-		hit = grep_cache(&opt, &pathspec, cached);
+		memset(filter_opts, 0, sizeof(*filter_opts));
+		filter_opts->pathspec = pathspec.raw;
+		opts.opt = &opt;
+		opts.pathspec = &pathspec;
+		opts.cached = cached;
+		opts.hit = 0;
+		read_cache_filtered(filter_opts);
+		for_each_cache_entry(grep_cache, &opts);
+		hit = opts.hit;
 	} else {
 		if (cached)
 			die(_("both --cached and trees are given."));
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 11/22] ls-files.c: use the index api
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (9 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 10/22] grep.c: Use " Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 12/22] read-cache: make read_blob_data_from_index use " Thomas Gummerer
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 builtin/ls-files.c | 213 +++++++++++++++++++++++++----------------------------
 1 file changed, 100 insertions(+), 113 deletions(-)

diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index 08d9786..82857d4 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -88,36 +88,35 @@ static void show_killed_files(struct dir_struct *dir)
 	for (i = 0; i < dir->nr; i++) {
 		struct dir_entry *ent = dir->entries[i];
 		char *cp, *sp;
-		int pos, len, killed = 0;
+		int len, killed = 0;
 
 		for (cp = ent->name; cp - ent->name < ent->len; cp = sp + 1) {
+			struct cache_entry *ce;
+
 			sp = strchr(cp, '/');
 			if (!sp) {
 				/* If ent->name is prefix of an entry in the
 				 * cache, it will be killed.
 				 */
-				pos = cache_name_pos(ent->name, ent->len);
-				if (0 <= pos)
+				if (get_cache_entry_by_name(ent->name, ent->len, &ce))
 					die("bug in show-killed-files");
-				pos = -pos - 1;
-				while (pos < active_nr &&
-				       ce_stage(active_cache[pos]))
-					pos++; /* skip unmerged */
-				if (active_nr <= pos)
+				while (ce && ce_stage(ce))
+					ce = next_cache_entry(ce);
+				if (!ce)
 					break;
 				/* pos points at a name immediately after
 				 * ent->name in the cache.  Does it expect
 				 * ent->name to be a directory?
 				 */
-				len = ce_namelen(active_cache[pos]);
+				len = ce_namelen(ce);
 				if ((ent->len < len) &&
-				    !strncmp(active_cache[pos]->name,
+				    !strncmp(ce->name,
 					     ent->name, ent->len) &&
-				    active_cache[pos]->name[ent->len] == '/')
+				    ce->name[ent->len] == '/')
 					killed = 1;
 				break;
 			}
-			if (0 <= cache_name_pos(ent->name, sp - ent->name)) {
+			if (get_cache_entry_by_name(ent->name, sp - ent->name, &ce)) {
 				/* If any of the leading directories in
 				 * ent->name is registered in the cache,
 				 * ent->name will be killed.
@@ -213,10 +212,43 @@ static int ce_excluded(struct dir_struct *dir, struct cache_entry *ce)
 	return is_excluded(dir, ce->name, &dtype);
 }
 
-static void show_files(struct dir_struct *dir)
+static int show_cached_stage(struct cache_entry *ce, void *cb_data)
 {
-	int i;
+	struct dir_struct *dir = cb_data;
+
+	if ((dir->flags & DIR_SHOW_IGNORED) && !ce_excluded(dir, ce))
+		return 0;
+	if (show_unmerged && !ce_stage(ce))
+		return 0;
+	if (ce->ce_flags & CE_UPDATE)
+		return 0;
+	show_ce_entry(ce_stage(ce) ? tag_unmerged :
+		(ce_skip_worktree(ce) ? tag_skip_worktree : tag_cached), ce);
+	return 0;
+}
 
+static int show_deleted_modified(struct cache_entry *ce, void *cb_data)
+{
+	struct stat st;
+	int err;
+	struct dir_struct *dir = cb_data;
+
+	if ((dir->flags & DIR_SHOW_IGNORED) && !ce_excluded(dir, ce))
+		return 0;
+	if (ce->ce_flags & CE_UPDATE)
+		return 0;
+	if (ce_skip_worktree(ce))
+		return 0;
+	err = lstat(ce->name, &st);
+	if (show_deleted && err)
+		show_ce_entry(tag_removed, ce);
+	if (show_modified && ce_modified(ce, &st, 0))
+		show_ce_entry(tag_modified, ce);
+	return 0;
+}
+
+static void show_files(struct dir_struct *dir)
+{
 	/* For cached/deleted files we don't need to even do the readdir */
 	if (show_others || show_killed) {
 		fill_directory(dir, pathspec);
@@ -225,66 +257,18 @@ static void show_files(struct dir_struct *dir)
 		if (show_killed)
 			show_killed_files(dir);
 	}
-	if (show_cached || show_stage) {
-		for (i = 0; i < active_nr; i++) {
-			struct cache_entry *ce = active_cache[i];
-			if ((dir->flags & DIR_SHOW_IGNORED) &&
-			    !ce_excluded(dir, ce))
-				continue;
-			if (show_unmerged && !ce_stage(ce))
-				continue;
-			if (ce->ce_flags & CE_UPDATE)
-				continue;
-			show_ce_entry(ce_stage(ce) ? tag_unmerged :
-				(ce_skip_worktree(ce) ? tag_skip_worktree : tag_cached), ce);
-		}
-	}
-	if (show_deleted || show_modified) {
-		for (i = 0; i < active_nr; i++) {
-			struct cache_entry *ce = active_cache[i];
-			struct stat st;
-			int err;
-			if ((dir->flags & DIR_SHOW_IGNORED) &&
-			    !ce_excluded(dir, ce))
-				continue;
-			if (ce->ce_flags & CE_UPDATE)
-				continue;
-			if (ce_skip_worktree(ce))
-				continue;
-			err = lstat(ce->name, &st);
-			if (show_deleted && err)
-				show_ce_entry(tag_removed, ce);
-			if (show_modified && ce_modified(ce, &st, 0))
-				show_ce_entry(tag_modified, ce);
-		}
-	}
+	if (show_cached | show_stage)
+		for_each_cache_entry(show_cached_stage, dir);
+	if (show_deleted | show_modified)
+		for_each_cache_entry(show_deleted_modified, dir);
 }
 
-/*
- * Prune the index to only contain stuff starting with "prefix"
- */
-static void prune_cache(const char *prefix)
+static int hoist_unmerged(struct cache_entry *ce, void *cb_data)
 {
-	int pos = cache_name_pos(prefix, max_prefix_len);
-	unsigned int first, last;
-
-	if (pos < 0)
-		pos = -pos-1;
-	memmove(active_cache, active_cache + pos,
-		(active_nr - pos) * sizeof(struct cache_entry *));
-	active_nr -= pos;
-	first = 0;
-	last = active_nr;
-	while (last > first) {
-		int next = (last + first) >> 1;
-		struct cache_entry *ce = active_cache[next];
-		if (!strncmp(ce->name, prefix, max_prefix_len)) {
-			first = next+1;
-			continue;
-		}
-		last = next;
-	}
-	active_nr = last;
+	if (!ce_stage(ce))
+		return 0;
+	ce->ce_flags |= CE_STAGEMASK;
+	return 0;
 }
 
 static void strip_trailing_slash_from_submodules(void)
@@ -292,16 +276,38 @@ static void strip_trailing_slash_from_submodules(void)
 	const char **p;
 
 	for (p = pathspec; *p != NULL; p++) {
-		int len = strlen(*p), pos;
+		int len = strlen(*p);
+		struct cache_entry *ce;
 
 		if (len < 1 || (*p)[len - 1] != '/')
 			continue;
-		pos = cache_name_pos(*p, len - 1);
-		if (pos >= 0 && S_ISGITLINK(active_cache[pos]->ce_mode))
+		if (get_cache_entry_by_name(*p, len - 1, &ce) && S_ISGITLINK(ce->ce_mode))
 			*p = xstrndup(*p, len - 1);
 	}
 }
 
+int mark_entry_to_show(struct cache_entry *ce, void *cb_data)
+{
+	struct cache_entry *last_stage0 = cb_data;
+	switch (ce_stage(ce)) {
+	case 0:
+		last_stage0 = ce;
+		/* fallthru */
+	default:
+		return 0;
+	case 1:
+		/*
+		 * If there is stage #0 entry for this, we do not
+		 * need to show it.  We use CE_UPDATE bit to mark
+		 * such an entry.
+		 */
+		if (last_stage0 &&
+			!strcmp(last_stage0->name, ce->name))
+			ce->ce_flags |= CE_UPDATE;
+	}
+	return 0;
+}
+
 /*
  * Read the tree specified with --with-tree option
  * (typically, HEAD) into stage #1 and then
@@ -316,7 +322,6 @@ void overlay_tree_on_cache(const char *tree_name, const char *prefix)
 	unsigned char sha1[20];
 	struct pathspec pathspec;
 	struct cache_entry *last_stage0 = NULL;
-	int i;
 
 	if (get_sha1(tree_name, sha1))
 		die("tree-ish %s not found.", tree_name);
@@ -325,12 +330,7 @@ void overlay_tree_on_cache(const char *tree_name, const char *prefix)
 		die("bad tree-ish %s", tree_name);
 
 	/* Hoist the unmerged entries up to stage #3 to make room */
-	for (i = 0; i < active_nr; i++) {
-		struct cache_entry *ce = active_cache[i];
-		if (!ce_stage(ce))
-			continue;
-		ce->ce_flags |= CE_STAGEMASK;
-	}
+	for_each_cache_entry(hoist_unmerged, NULL);
 
 	if (prefix) {
 		static const char *(matchbuf[2]);
@@ -343,25 +343,7 @@ void overlay_tree_on_cache(const char *tree_name, const char *prefix)
 	if (read_tree(tree, 1, &pathspec))
 		die("unable to read tree entries %s", tree_name);
 
-	for (i = 0; i < active_nr; i++) {
-		struct cache_entry *ce = active_cache[i];
-		switch (ce_stage(ce)) {
-		case 0:
-			last_stage0 = ce;
-			/* fallthru */
-		default:
-			continue;
-		case 1:
-			/*
-			 * If there is stage #0 entry for this, we do not
-			 * need to show it.  We use CE_UPDATE bit to mark
-			 * such an entry.
-			 */
-			if (last_stage0 &&
-			    !strcmp(last_stage0->name, ce->name))
-				ce->ce_flags |= CE_UPDATE;
-		}
-	}
+	for_each_cache_entry(mark_entry_to_show, last_stage0);
 }
 
 int report_path_error(const char *ps_matched, const char **pathspec, const char *prefix)
@@ -457,6 +439,7 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 	struct dir_struct dir;
 	struct exclude_list *el;
 	struct string_list exclude_list = STRING_LIST_INIT_NODUP;
+	struct filter_opts *opts = xmalloc(sizeof(*opts));
 	struct option builtin_ls_files_options[] = {
 		{ OPTION_CALLBACK, 'z', NULL, NULL, NULL,
 			N_("paths are separated with NUL character"),
@@ -522,9 +505,6 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 		prefix_len = strlen(prefix);
 	git_config(git_default_config, NULL);
 
-	if (read_cache() < 0)
-		die("index file corrupt");
-
 	argc = parse_options(argc, argv, prefix, builtin_ls_files_options,
 			ls_files_usage, 0);
 	el = add_exclude_list(&dir, EXC_CMDL, "--exclude option");
@@ -557,14 +537,6 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 
 	pathspec = get_pathspec(prefix, argv);
 
-	/* be nice with submodule paths ending in a slash */
-	if (pathspec)
-		strip_trailing_slash_from_submodules();
-
-	/* Find common prefix for all pathspec's */
-	max_prefix = common_prefix(pathspec);
-	max_prefix_len = max_prefix ? strlen(max_prefix) : 0;
-
 	/* Treat unmatching pathspec elements as errors */
 	if (pathspec && error_unmatch) {
 		int num;
@@ -573,6 +545,23 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 		ps_matched = xcalloc(1, num);
 	}
 
+	memset(opts, 0, sizeof(*opts));
+	opts->pathspec = pathspec;
+	opts->read_staged = 1;
+	if (show_resolve_undo)
+		opts->read_resolve_undo = 1;
+	read_cache_filtered(opts);
+
+	if (pathspec) {
+		strip_trailing_slash_from_submodules();
+		opts->pathspec = pathspec;
+		cache_change_filter_opts(opts);
+	}
+
+	/* Find common prefix for all pathspec's */
+	max_prefix = opts->max_prefix;
+	max_prefix_len = opts->max_prefix_len;
+
 	if ((dir.flags & DIR_SHOW_IGNORED) && !exc_given)
 		die("ls-files --ignored needs some exclude pattern");
 
@@ -581,8 +570,6 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 	      show_killed || show_modified || show_resolve_undo))
 		show_cached = 1;
 
-	if (max_prefix)
-		prune_cache(max_prefix);
 	if (with_tree) {
 		/*
 		 * Basic sanity check; show-stages and show-unmerged
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 12/22] read-cache: make read_blob_data_from_index use index api
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (10 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 11/22] ls-files.c: use the " Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 13/22] documentation: add documentation of the index-v5 file format Thomas Gummerer
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 read-cache.c | 22 ++++++++++------------
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 4529fab..c81e643 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1588,29 +1588,27 @@ int index_name_is_other(const struct index_state *istate, const char *name,
 
 void *read_blob_data_from_index(struct index_state *istate, const char *path, unsigned long *size)
 {
-	int pos, len;
+	int ret, len;
 	unsigned long sz;
 	enum object_type type;
 	void *data;
+	struct cache_entry *ce;
 
 	len = strlen(path);
-	pos = index_name_pos(istate, path, len);
-	if (pos < 0) {
+	ret = get_index_entry_by_name(istate, path, len, &ce);
+	if (!ret) {
 		/*
 		 * We might be in the middle of a merge, in which
 		 * case we would read stage #2 (ours).
 		 */
-		int i;
-		for (i = -pos - 1;
-		     (pos < 0 && i < istate->cache_nr &&
-		      !strcmp(istate->cache[i]->name, path));
-		     i++)
-			if (ce_stage(istate->cache[i]) == 2)
-				pos = i;
+		for (; !ret && ce && !strcmp(ce->name, path); ce = next_index_entry(ce))
+			if (ce_stage(ce) == 2)
+				ret = 1;
+
 	}
-	if (pos < 0)
+	if (!ret)
 		return NULL;
-	data = read_sha1_file(istate->cache[pos]->sha1, &type, &sz);
+	data = read_sha1_file(ce->sha1, &type, &sz);
 	if (!data || type != OBJ_BLOB) {
 		free(data);
 		return NULL;
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 13/22] documentation: add documentation of the index-v5 file format
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (11 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 12/22] read-cache: make read_blob_data_from_index use " Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-11 10:39   ` Duy Nguyen
  2013-07-07  8:11 ` [PATCH 14/22] read-cache: make in-memory format aware of stat_crc Thomas Gummerer
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Add a documentation of the index file format version 5 to
Documentation/technical.

Helped-by: Michael Haggerty <mhagger@alum.mit.edu>
Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Thomas Rast <trast@student.ethz.ch>
Helped-by: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Helped-by: Robin Rosenberg <robin.rosenberg@dewire.com>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 Documentation/technical/index-file-format-v5.txt | 296 +++++++++++++++++++++++
 1 file changed, 296 insertions(+)
 create mode 100644 Documentation/technical/index-file-format-v5.txt

diff --git a/Documentation/technical/index-file-format-v5.txt b/Documentation/technical/index-file-format-v5.txt
new file mode 100644
index 0000000..4213087
--- /dev/null
+++ b/Documentation/technical/index-file-format-v5.txt
@@ -0,0 +1,296 @@
+GIT index format
+================
+
+== The git index
+
+   The git index file (.git/index) documents the status of the files
+     in the git staging area.
+
+   The staging area is used for preparing commits, merging, etc.
+
+== The git index file format
+
+   All binary numbers are in network byte order. Version 5 is described
+     here. The index file consists of various sections. They appear in
+     the following order in the file.
+
+   - header: the description of the index format, including it's signature,
+     version and various other fields that are used internally.
+
+   - diroffsets (ndir entries of "direcotry offset"): A 4-byte offset
+       relative to the beginning of the "direntries block" (see below)
+       for each of the ndir directories in the index, sorted by pathname
+       (of the directory it's pointing to). [1]
+
+   - direntries (ndir entries of "directory offset"): A directory entry
+       for each of the ndir directories in the index, sorted by pathname
+       (see below). [2]
+
+   - fileoffsets (nfile entries of "file offset"): A 4-byte offset
+       relative to the beginning of the fileentries block (see below)
+       for each of the nfile files in the index. [1]
+
+   - fileentries (nfile entries of "file entry"): A file entry for
+       each of the nfile files in the index (see below).
+
+   - crdata: A number of entries for conflicted data/resolved conflicts
+       (see below).
+
+   - Extensions (Currently none, see below in the future)
+
+     Extensions are identified by signature. Optional extensions can
+     be ignored if GIT does not understand them.
+
+     GIT supports an arbitrary number of extension, but currently none
+     is implemented. [3]
+
+     extsig (32-bits): extension signature. If the first byte is 'A'..'Z'
+     the extension is optional and can be ignored.
+
+     extsize (32-bits): size of the extension, excluding the header
+       (extsig, extsize, extchecksum).
+
+     extchecksum (32-bits): crc32 checksum of the extension signature
+       and size.
+
+    - Extension data.
+
+== Header
+   sig (32-bits): Signature:
+     The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
+
+   vnr (32-bits): Version number:
+     The current supported versions are 2, 3, 4 and 5.
+
+   ndir (32-bits): number of directories in the index.
+
+   nfile (32-bits): number of file entries in the index.
+
+   fblockoffset (32-bits): offset to the file block, relative to the
+     beginning of the file.
+
+   - Offset to the extensions.
+
+     nextensions (32-bits): number of extensions.
+
+     extoffset (32-bits): offset to the extension. (Possibly none, as
+       many as indicated in the 4-byte number of extensions)
+
+   headercrc (32-bits): crc checksum including the header and the
+     offsets to the extensions.
+
+
+== Directory offsets (diroffsets)
+
+  diroffset (32-bits): offset to the directory relative to the beginning
+    of the index file. There are ndir + 1 offsets in the diroffset table,
+    the last is pointing to the end of the last direntry. With this last
+    entry, we are able to replace the strlen of when reading the directory
+    name, by calculating it from diroffset[n+1]-diroffset[n]-61.  61 is the
+    size of the directory data, which follows each each directory + the
+    crc sum + the NUL byte.
+
+  This part is needed for making the directory entries bisectable and
+    thus allowing a binary search.
+
+== Directory entry (direntries)
+
+  Directory entries are sorted in lexicographic order by the name
+    of their path starting with the root.
+
+  pathname (variable length, nul terminated): relative to top level
+    directory (without the leading slash). '/' is used as path
+    separator. A string of length 0 ('') indicates the root directory.
+    The special path components ".", and ".." (without quotes) are
+    disallowed. The path also includes a trailing slash. [9]
+
+  foffset (32-bits): offset to the lexicographically first file in
+    the file offsets (fileoffsets), relative to the beginning of
+    the fileoffset block.
+
+  cr (32-bits): offset to conflicted/resolved data at the end of the
+    index. 0 if there is no such data. [4]
+
+  ncr (32-bits): number of conflicted/resolved data entries at the
+    end of the index if the offset is non 0. If cr is 0, ncr is
+    also 0.
+
+  nsubtrees (32-bits): number of subtrees this tree has in the index.
+
+  nfiles (32-bits): number of files in the directory, that are in
+    the index.
+
+  nentries (32-bits): number of entries in the index that is covered
+    by the tree this entry represents. (-1 if the entry is invalid).
+    This number includes all the files in this tree, recursively.
+
+  objname (160-bits): object name for the object that would result
+    from writing this span of index as a tree. This is only valid
+    if nentries is valid, meaning the cache-tree is valid.
+
+  flags (16-bits): 'flags' field split into (high to low bits) (For
+    D/F conflicts)
+
+    stage (2-bits): stage of the directory during merge
+
+    14-bit unused
+
+  dircrc (32-bits): crc32 checksum for each directory entry.
+
+  The last 24 bytes (4-byte number of entries + 160-bit object name) are
+    for the cache tree. An entry can be in an invalidated state which is
+    represented by having -1 in the entry_count field.
+
+  The entries are written out in the top-down, depth-first order. The
+    first entry represents the root level of the repository, followed by
+    the first subtree - let's call it A - of the root level, followed by
+    the first subtree of A, ... There is no prefix compression for
+    directories.
+
+== File offsets (fileoffsets)
+
+  fileoffset (32-bits): offset to the file relative to the beginning of
+    the fileentries block.
+
+  This part is needed for making the file entries bisectable and
+    thus allowing a binary search. There are nfile + 1 offsets in the
+    fileoffset table, the last is pointing to the end of the last
+    fileentry. With this last entry, we can replace the strlen when
+    reading each filename, by calculating its length with the offsets.
+
+== File entry (fileentries)
+
+  File entries are sorted in ascending order on the name field, after the
+  respective offset given by the directory entries. All file names are
+  prefix compressed, meaning the file name is relative to the directory.
+
+  filename (variable length, nul terminated). The exact encoding is
+    undefined, but the filename cannot contain a NUL byte (iow, the same
+    encoding as a UNIX pathname).
+
+  flags (16-bits): 'flags' field split into (high to low bits)
+
+    assumevalid (1-bit): assume-valid flag
+
+    intenttoadd (1-bit): intent-to-add flag, used by "git add -N".
+      Extended flag in index v3.
+
+    stage (2-bit): stage of the file during merge
+
+    skipworktree (1-bit): skip-worktree flag, used by sparse checkout.
+      Extended flag in index v3.
+
+    smudged (1-bit): indicates if the file is racily smudged.
+
+    10-bit unused, must be zero [6]
+
+  mode (16-bits): file mode, split into (high to low bits)
+
+    objtype (4-bits): object type
+      valid values in binary are 1000 (regular file), 1010 (symbolic
+      link) and 1110 (gitlink)
+
+    3-bit unused
+
+    permission (9-bits): unix permission. Only 0755 and 0644 are valid
+      for regular files. Symbolic links and gitlinks have value 0 in
+      this field.
+
+  mtimes (32-bits): mtime seconds, the last time a file's data changed
+    this is stat(2) data
+
+  mtimens (32-bits): mtime nanosecond fractions
+    this is stat(2) data
+
+  file size (32-bits): The on-disk size, trucated to 32-bit.
+    this is stat(2) data
+
+  statcrc (32-bits): crc32 checksum over ctime seconds, ctime
+    nanoseconds, ino, dev, uid, gid (All stat(2) data
+    except mtime and file size). If the statcrc is 0 it will
+    be ignored. [7]
+
+  objhash (160-bits): SHA-1 for the represented object
+
+  entrycrc (32-bits): crc32 checksum for the file entry. The crc code
+    includes the offset to the offset to the file, relative to the
+    beginning of the file.
+
+== Conflict data
+
+  A conflict is represented in the index as a set of higher stage entries.
+  These entries are stored at the end of the index. When a conflict is
+  resolved (e.g. with "git add path"). A bit is flipped, to indicate that
+  the conflict is resolved, but the entries will be kept, so that
+  conflicts can be recreated (e.g. with "git checkout -m", in case users
+  want to redo a conflict resolution from scratch.
+
+  The first part of a conflict (usually stage 1) will be stored both in
+  the entries part of the index and in the conflict part. All other parts
+  will only be stored in the conflict part.
+
+  filename (variable length, nul terminated): filename of the entry,
+    relative to its containing directory).
+
+  nfileconflicts (32-bits): number of conflicts for the file [8]
+
+  flags (nfileconflicts entries of "flags") (16-bits): 'flags' field
+    split into:
+
+    conflicted (1-bit): conflicted state (conflicted/resolved) (1 if
+      conflicted)
+
+    stage (2-bits): stage during merge.
+
+    13-bit unused
+
+  entry_mode (nfileconflicts entries of "entry mode") (16-bits):
+    octal numbers, entry mode of eache entry in the different stages.
+    (How many is defined by the 4-byte number before)
+
+  objectnames (nfileconflicts entries of "object name") (160-bits):
+    object names  of the different stages.
+
+  conflictcrc (32-bits): crc32 checksum over conflict data.
+
+== Design explanations
+
+[1] The directory and file offsets are included in the index format
+    to enable bisectability of the index, for binary searches.Updating
+    a single entry and partial reading will benefit from this.
+
+[2] The directories are saved in their own block, to be able to
+    quickly search for a directory in the index. They include a
+    offset to the (lexically) first file in the directory.
+
+[3] The data of the cache-tree extension and the resolve undo
+    extension is now part of the index itself, but if other extensions
+    come up in the future, there is no need to change the index, they
+    can simply be added at the end.
+
+[4] To avoid rewrites of the whole index when there are conflicts or
+    conflicts are being resolved, conflicted data will be stored at
+    the end of the index. To mark the conflict resolved, just a bit
+    has to be flipped. The data will still be there, if a user wants
+    to redo the conflict resolution.
+
+[5] Since only 4 modes are effectively allowed in git but 32-bit are
+    used to store them, having a two bit flag for the mode is enough
+    and saves 4 byte per entry.
+
+[6] The length of the file name was dropped, since each file name is
+    nul terminated anyway.
+
+[7] Since all stat data (except mtime and ctime) is just used for
+    checking if a file has changed a checksum of the data is enough.
+    In addition to that Thomas Rast suggested ctime could be ditched
+    completely (core.trustctime=false) and thus included in the
+    checksum. This would save 24 bytes per index entry, which would
+    be about 4 MB on the Webkit index.
+    (Thanks for the suggestion to Michael Haggerty)
+
+[8] Since there can be more stage #1 entries, it is necessary to know
+    the number of conflict data entries there are.
+
+[9] As Michael Haggerty pointed out on the mailing list, storing the
+    trailing slash will simplify a few operations.
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 14/22] read-cache: make in-memory format aware of stat_crc
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (12 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 13/22] documentation: add documentation of the index-v5 file format Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 15/22] read-cache: read index-v5 Thomas Gummerer
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Make the in-memory format aware of the stat_crc used by index-v5.
It is simply ignored by index version prior to v5.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 cache.h      |  1 +
 read-cache.c | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/cache.h b/cache.h
index f6c3407..d77af5e 100644
--- a/cache.h
+++ b/cache.h
@@ -127,6 +127,7 @@ struct cache_entry {
 	unsigned int ce_flags;
 	unsigned int ce_namelen;
 	unsigned char sha1[20];
+	uint32_t ce_stat_crc;
 	struct cache_entry *next; /* used by name_hash */
 	struct cache_entry *next_ce; /* used to keep a list of cache entries */
 	char name[FLEX_ARRAY]; /* more */
diff --git a/read-cache.c b/read-cache.c
index c81e643..5ec0222 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -108,6 +108,29 @@ int match_stat_data(const struct stat_data *sd, struct stat *st)
 	return changed;
 }
 
+static uint32_t calculate_stat_crc(struct cache_entry *ce)
+{
+	unsigned int ctimens = 0;
+	uint32_t stat, stat_crc;
+
+	stat = htonl(ce->ce_stat_data.sd_ctime.sec);
+	stat_crc = crc32(0, (Bytef*)&stat, 4);
+#ifdef USE_NSEC
+	ctimens = ce->ce_stat_data.sd_ctime.nsec;
+#endif
+	stat = htonl(ctimens);
+	stat_crc = crc32(stat_crc, (Bytef*)&stat, 4);
+	stat = htonl(ce->ce_stat_data.sd_ino);
+	stat_crc = crc32(stat_crc, (Bytef*)&stat, 4);
+	stat = htonl(ce->ce_stat_data.sd_dev);
+	stat_crc = crc32(stat_crc, (Bytef*)&stat, 4);
+	stat = htonl(ce->ce_stat_data.sd_uid);
+	stat_crc = crc32(stat_crc, (Bytef*)&stat, 4);
+	stat = htonl(ce->ce_stat_data.sd_gid);
+	stat_crc = crc32(stat_crc, (Bytef*)&stat, 4);
+	return stat_crc;
+}
+
 /*
  * This only updates the "non-critical" parts of the directory
  * cache, ie the parts that aren't tracked by GIT, and only used
@@ -122,6 +145,8 @@ void fill_stat_cache_info(struct cache_entry *ce, struct stat *st)
 
 	if (S_ISREG(st->st_mode))
 		ce_mark_uptodate(ce);
+
+	ce->ce_stat_crc = calculate_stat_crc(ce);
 }
 
 static int ce_compare_data(const struct cache_entry *ce, struct stat *st)
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 15/22] read-cache: read index-v5
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (13 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 14/22] read-cache: make in-memory format aware of stat_crc Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07 20:18   ` Eric Sunshine
  2013-07-07  8:11 ` [PATCH 16/22] read-cache: read resolve-undo data Thomas Gummerer
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Make git read the index file version 5 without complaining.

This version of the reader doesn't read neither the cache-tree
nor the resolve undo data, but doesn't choke on an index that
includes such data.

Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Helped-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 Makefile        |   1 +
 cache.h         |  75 ++++++-
 read-cache-v5.c | 658 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 read-cache.h    |   1 +
 4 files changed, 734 insertions(+), 1 deletion(-)
 create mode 100644 read-cache-v5.c

diff --git a/Makefile b/Makefile
index 73369ae..80e35f5 100644
--- a/Makefile
+++ b/Makefile
@@ -856,6 +856,7 @@ LIB_OBJS += quote.o
 LIB_OBJS += reachable.o
 LIB_OBJS += read-cache.o
 LIB_OBJS += read-cache-v2.o
+LIB_OBJS += read-cache-v5.o
 LIB_OBJS += reflog-walk.o
 LIB_OBJS += refs.o
 LIB_OBJS += remote.o
diff --git a/cache.h b/cache.h
index d77af5e..e110ec8 100644
--- a/cache.h
+++ b/cache.h
@@ -99,7 +99,7 @@ unsigned long git_deflate_bound(git_zstream *, unsigned long);
 #define CACHE_SIGNATURE 0x44495243	/* "DIRC" */
 
 #define INDEX_FORMAT_LB 2
-#define INDEX_FORMAT_UB 4
+#define INDEX_FORMAT_UB 5
 
 /*
  * The "cache_time" is just the low 32 bits of the
@@ -121,6 +121,15 @@ struct stat_data {
 	unsigned int sd_size;
 };
 
+/*
+ * The *next pointer is used in read_entries_v5 for holding
+ * all the elements of a directory, and points to the next
+ * cache_entry in a directory.
+ *
+ * It is reset by the add_name_hash call in set_index_entry
+ * to set it to point to the next cache_entry in the
+ * correct in-memory format ordering.
+ */
 struct cache_entry {
 	struct stat_data ce_stat_data;
 	unsigned int ce_mode;
@@ -133,11 +142,59 @@ struct cache_entry {
 	char name[FLEX_ARRAY]; /* more */
 };
 
+struct directory_entry {
+	struct directory_entry *next;
+	struct directory_entry *next_hash;
+	struct cache_entry *ce;
+	struct cache_entry *ce_last;
+	struct conflict_entry *conflict;
+	struct conflict_entry *conflict_last;
+	unsigned int conflict_size;
+	unsigned int de_foffset;
+	unsigned int de_cr;
+	unsigned int de_ncr;
+	unsigned int de_nsubtrees;
+	unsigned int de_nfiles;
+	unsigned int de_nentries;
+	unsigned char sha1[20];
+	unsigned short de_flags;
+	unsigned int de_pathlen;
+	char pathname[FLEX_ARRAY];
+};
+
+struct conflict_part {
+	struct conflict_part *next;
+	unsigned short flags;
+	unsigned short entry_mode;
+	unsigned char sha1[20];
+};
+
+struct conflict_entry {
+	struct conflict_entry *next;
+	unsigned int nfileconflicts;
+	struct conflict_part *entries;
+	unsigned int namelen;
+	unsigned int pathlen;
+	char name[FLEX_ARRAY];
+};
+
+struct ondisk_conflict_part {
+	unsigned short flags;
+	unsigned short entry_mode;
+	unsigned char sha1[20];
+};
+
+#define CE_NAMEMASK  (0x0fff)
 #define CE_STAGEMASK (0x3000)
 #define CE_EXTENDED  (0x4000)
 #define CE_VALID     (0x8000)
+#define CE_SMUDGED   (0x0400) /* index v5 only flag */
 #define CE_STAGESHIFT 12
 
+#define CONFLICT_CONFLICTED (0x8000)
+#define CONFLICT_STAGESHIFT 13
+#define CONFLICT_STAGEMASK (0x6000)
+
 /*
  * Range 0xFFFF0000 in ce_flags is divided into
  * two parts: in-memory flags and on-disk ones.
@@ -174,6 +231,18 @@ struct cache_entry {
 #define CE_EXTENDED_FLAGS (CE_INTENT_TO_ADD | CE_SKIP_WORKTREE)
 
 /*
+ * Representation of the extended on-disk flags in the v5 format.
+ * They must not collide with the ordinary on-disk flags, and need to
+ * fit in 16 bits.  Note however that v5 does not save the name
+ * length.
+ */
+#define CE_INTENT_TO_ADD_V5  (0x4000)
+#define CE_SKIP_WORKTREE_V5  (0x0800)
+#if (CE_VALID|CE_STAGEMASK) & (CE_INTENTTOADD_V5|CE_SKIPWORKTREE_V5)
+#error "v5 on-disk flags collide with ordinary on-disk flags"
+#endif
+
+/*
  * Safeguard to avoid saving wrong flags:
  *  - CE_EXTENDED2 won't get saved until its semantic is known
  *  - Bits in 0x0000FFFF have been saved in ce_flags already
@@ -212,6 +281,8 @@ static inline unsigned create_ce_flags(unsigned stage)
 #define ce_skip_worktree(ce) ((ce)->ce_flags & CE_SKIP_WORKTREE)
 #define ce_mark_uptodate(ce) ((ce)->ce_flags |= CE_UPTODATE)
 
+#define conflict_stage(c) ((CONFLICT_STAGEMASK & (c)->flags) >> CONFLICT_STAGESHIFT)
+
 #define ce_permissions(mode) (((mode) & 0100) ? 0755 : 0644)
 static inline unsigned int create_ce_mode(unsigned int mode)
 {
@@ -259,6 +330,8 @@ static inline unsigned int canon_mode(unsigned int mode)
 }
 
 #define cache_entry_size(len) (offsetof(struct cache_entry,name) + (len) + 1)
+#define directory_entry_size(len) (offsetof(struct directory_entry,pathname) + (len) + 1)
+#define conflict_entry_size(len) (offsetof(struct conflict_entry,name) + (len) + 1)
 
 /*
  * Options by which the index should be filtered when read partially.
diff --git a/read-cache-v5.c b/read-cache-v5.c
new file mode 100644
index 0000000..e319f30
--- /dev/null
+++ b/read-cache-v5.c
@@ -0,0 +1,658 @@
+#include "cache.h"
+#include "read-cache.h"
+#include "resolve-undo.h"
+#include "cache-tree.h"
+#include "dir.h"
+
+#define ptr_add(x,y) ((void *)(((char *)(x)) + (y)))
+
+struct cache_header {
+	unsigned int hdr_ndir;
+	unsigned int hdr_nfile;
+	unsigned int hdr_fblockoffset;
+	unsigned int hdr_nextension;
+};
+
+/*****************************************************************
+ * Index File I/O
+ *****************************************************************/
+
+struct ondisk_cache_entry {
+	unsigned short flags;
+	unsigned short mode;
+	struct cache_time mtime;
+	unsigned int size;
+	int stat_crc;
+	unsigned char sha1[20];
+};
+
+struct ondisk_directory_entry {
+	unsigned int foffset;
+	unsigned int cr;
+	unsigned int ncr;
+	unsigned int nsubtrees;
+	unsigned int nfiles;
+	unsigned int nentries;
+	unsigned char sha1[20];
+	unsigned short flags;
+};
+
+static int check_crc32(int initialcrc,
+			void *data,
+			size_t len,
+			unsigned int expected_crc)
+{
+	int crc;
+
+	crc = crc32(initialcrc, (Bytef*)data, len);
+	return crc == expected_crc;
+}
+
+static int match_stat_crc(struct stat *st, uint32_t expected_crc)
+{
+	uint32_t data, stat_crc = 0;
+	unsigned int ctimens = 0;
+
+	data = htonl(st->st_ctime);
+	stat_crc = crc32(0, (Bytef*)&data, 4);
+#ifdef USE_NSEC
+	ctimens = ST_CTIME_NSEC(*st);
+#endif
+	data = htonl(ctimens);
+	stat_crc = crc32(stat_crc, (Bytef*)&data, 4);
+	data = htonl(st->st_ino);
+	stat_crc = crc32(stat_crc, (Bytef*)&data, 4);
+	data = htonl(st->st_dev);
+	stat_crc = crc32(stat_crc, (Bytef*)&data, 4);
+	data = htonl(st->st_uid);
+	stat_crc = crc32(stat_crc, (Bytef*)&data, 4);
+	data = htonl(st->st_gid);
+	stat_crc = crc32(stat_crc, (Bytef*)&data, 4);
+
+	return stat_crc == expected_crc;
+}
+
+static int match_stat_basic(const struct cache_entry *ce,
+			    struct stat *st,
+			    int changed)
+{
+
+	if (ce->ce_stat_data.sd_mtime.sec != (unsigned int)st->st_mtime)
+		changed |= MTIME_CHANGED;
+#ifdef USE_NSEC
+	if (ce->ce_stat_data.sd_mtime.nsec != ST_MTIME_NSEC(*st))
+		changed |= MTIME_CHANGED;
+#endif
+	if (ce->ce_stat_data.sd_size != (unsigned int)st->st_size)
+		changed |= DATA_CHANGED;
+
+	if (trust_ctime && ce->ce_stat_crc != 0 && !match_stat_crc(st, ce->ce_stat_crc)) {
+		changed |= OWNER_CHANGED;
+		changed |= INODE_CHANGED;
+	}
+	/* Racily smudged entry? */
+	if (ce->ce_flags & CE_SMUDGED) {
+		if (!changed && !is_empty_blob_sha1(ce->sha1) && ce_modified_check_fs(ce, st))
+			changed |= DATA_CHANGED;
+	}
+	return changed;
+}
+
+static int verify_hdr(void *mmap, unsigned long size)
+{
+	uint32_t *filecrc;
+	unsigned int header_size;
+	struct cache_version_header *hdr;
+	struct cache_header *hdr_v5;
+
+	if (size < sizeof(struct cache_version_header)
+			+ sizeof (struct cache_header) + 4)
+		die("index file smaller than expected");
+
+	hdr = mmap;
+	hdr_v5 = ptr_add(mmap, sizeof(*hdr));
+	/* Size of the header + the size of the extensionoffsets */
+	header_size = sizeof(*hdr) + sizeof(*hdr_v5) + hdr_v5->hdr_nextension * 4;
+	/* Initialize crc */
+	filecrc = ptr_add(mmap, header_size);
+	if (!check_crc32(0, hdr, header_size, ntohl(*filecrc)))
+		return error("bad index file header crc signature");
+	return 0;
+}
+
+static struct cache_entry *cache_entry_from_ondisk(struct ondisk_cache_entry *ondisk,
+						   struct directory_entry *de,
+						   char *name,
+						   size_t len,
+						   size_t prefix_len)
+{
+	struct cache_entry *ce = xmalloc(cache_entry_size(len + de->de_pathlen));
+	int flags;
+
+	flags = ntoh_s(ondisk->flags);
+	ce->ce_stat_data.sd_ctime.sec  = 0;
+	ce->ce_stat_data.sd_mtime.sec  = ntoh_l(ondisk->mtime.sec);
+	ce->ce_stat_data.sd_ctime.nsec = 0;
+	ce->ce_stat_data.sd_mtime.nsec = ntoh_l(ondisk->mtime.nsec);
+	ce->ce_stat_data.sd_dev        = 0;
+	ce->ce_stat_data.sd_ino        = 0;
+	ce->ce_stat_data.sd_uid        = 0;
+	ce->ce_stat_data.sd_gid        = 0;
+	ce->ce_stat_data.sd_size       = ntoh_l(ondisk->size);
+	ce->ce_mode       = ntoh_s(ondisk->mode);
+	ce->ce_flags      = flags & CE_STAGEMASK;
+	ce->ce_flags     |= flags & CE_VALID;
+	ce->ce_flags     |= flags & CE_SMUDGED;
+	if (flags & CE_INTENT_TO_ADD_V5)
+		ce->ce_flags |= CE_INTENT_TO_ADD;
+	if (flags & CE_SKIP_WORKTREE_V5)
+		ce->ce_flags |= CE_SKIP_WORKTREE;
+	ce->ce_stat_crc   = ntoh_l(ondisk->stat_crc);
+	ce->ce_namelen    = len + de->de_pathlen;
+	hashcpy(ce->sha1, ondisk->sha1);
+	memcpy(ce->name, de->pathname, de->de_pathlen);
+	memcpy(ce->name + de->de_pathlen, name, len);
+	ce->name[len + de->de_pathlen] = '\0';
+	ce->next_ce = NULL;
+	return ce;
+}
+
+static struct directory_entry *directory_entry_from_ondisk(struct ondisk_directory_entry *ondisk,
+						   const char *name,
+						   size_t len)
+{
+	struct directory_entry *de = xmalloc(directory_entry_size(len));
+
+
+	memcpy(de->pathname, name, len);
+	de->pathname[len] = '\0';
+	de->de_flags      = ntoh_s(ondisk->flags);
+	de->de_foffset    = ntoh_l(ondisk->foffset);
+	de->de_cr         = ntoh_l(ondisk->cr);
+	de->de_ncr        = ntoh_l(ondisk->ncr);
+	de->de_nsubtrees  = ntoh_l(ondisk->nsubtrees);
+	de->de_nfiles     = ntoh_l(ondisk->nfiles);
+	de->de_nentries   = ntoh_l(ondisk->nentries);
+	de->de_pathlen    = len;
+	hashcpy(de->sha1, ondisk->sha1);
+	return de;
+}
+
+static struct conflict_part *conflict_part_from_ondisk(struct ondisk_conflict_part *ondisk)
+{
+	struct conflict_part *cp = xmalloc(sizeof(struct conflict_part));
+
+	cp->flags      = ntoh_s(ondisk->flags);
+	cp->entry_mode = ntoh_s(ondisk->entry_mode);
+	hashcpy(cp->sha1, ondisk->sha1);
+	return cp;
+}
+
+static struct cache_entry *convert_conflict_part(struct conflict_part *cp,
+						char * name,
+						unsigned int len)
+{
+
+	struct cache_entry *ce = xmalloc(cache_entry_size(len));
+
+	ce->ce_stat_data.sd_ctime.sec  = 0;
+	ce->ce_stat_data.sd_mtime.sec  = 0;
+	ce->ce_stat_data.sd_ctime.nsec = 0;
+	ce->ce_stat_data.sd_mtime.nsec = 0;
+	ce->ce_stat_data.sd_dev        = 0;
+	ce->ce_stat_data.sd_ino        = 0;
+	ce->ce_stat_data.sd_uid        = 0;
+	ce->ce_stat_data.sd_gid        = 0;
+	ce->ce_stat_data.sd_size       = 0;
+	ce->ce_mode       = cp->entry_mode;
+	ce->ce_flags      = conflict_stage(cp) << CE_STAGESHIFT;
+	ce->ce_stat_crc   = 0;
+	ce->ce_namelen    = len;
+	hashcpy(ce->sha1, cp->sha1);
+	memcpy(ce->name, name, len);
+	ce->name[len] = '\0';
+	return ce;
+}
+
+static struct directory_entry *read_directories(unsigned int *dir_offset,
+				unsigned int *dir_table_offset,
+				void *mmap,
+				int mmap_size)
+{
+	int i, ondisk_directory_size;
+	uint32_t *filecrc, *beginning, *end;
+	struct directory_entry *current = NULL;
+	struct ondisk_directory_entry *disk_de;
+	struct directory_entry *de;
+	unsigned int data_len, len;
+	char *name;
+
+	/* Length of pathname + nul byte for termination + size of
+	 * members of ondisk_directory_entry. (Just using the size
+	 * of the stuct doesn't work, because there may be padding
+	 * bytes for the struct)
+	 */
+	ondisk_directory_size = sizeof(disk_de->flags)
+		+ sizeof(disk_de->foffset)
+		+ sizeof(disk_de->cr)
+		+ sizeof(disk_de->ncr)
+		+ sizeof(disk_de->nsubtrees)
+		+ sizeof(disk_de->nfiles)
+		+ sizeof(disk_de->nentries)
+		+ sizeof(disk_de->sha1);
+	name = ptr_add(mmap, *dir_offset);
+	beginning = ptr_add(mmap, *dir_table_offset);
+	end = ptr_add(mmap, *dir_table_offset + 4);
+	len = ntoh_l(*end) - ntoh_l(*beginning) - ondisk_directory_size - 5;
+	disk_de = ptr_add(mmap, *dir_offset + len + 1);
+	de = directory_entry_from_ondisk(disk_de, name, len);
+	de->next = NULL;
+
+	data_len = len + 1 + ondisk_directory_size;
+	filecrc = ptr_add(mmap, *dir_offset + data_len);
+	if (!check_crc32(0, ptr_add(mmap, *dir_offset), data_len, ntoh_l(*filecrc)))
+		goto unmap;
+
+	*dir_table_offset += 4;
+	*dir_offset += data_len + 4; /* crc code */
+
+	current = de;
+	for (i = 0; i < de->de_nsubtrees; i++) {
+		current->next = read_directories(dir_offset, dir_table_offset,
+						mmap, mmap_size);
+		while (current->next)
+			current = current->next;
+	}
+
+	return de;
+unmap:
+	munmap(mmap, mmap_size);
+	die("directory crc doesn't match for '%s'", de->pathname);
+}
+
+static int read_entry(struct cache_entry **ce, struct directory_entry *de,
+		      unsigned int *entry_offset,
+		      void **mmap, unsigned long mmap_size,
+		      unsigned int *foffsetblock)
+{
+	int len, offset_to_offset;
+	char *name;
+	uint32_t foffsetblockcrc;
+	uint32_t *filecrc, *beginning, *end;
+	struct ondisk_cache_entry *disk_ce;
+
+	name = ptr_add(*mmap, *entry_offset);
+	beginning = ptr_add(*mmap, *foffsetblock);
+	end = ptr_add(*mmap, *foffsetblock + 4);
+	len = ntoh_l(*end) - ntoh_l(*beginning) - sizeof(struct ondisk_cache_entry) - 5;
+	disk_ce = ptr_add(*mmap, *entry_offset + len + 1);
+	*ce = cache_entry_from_ondisk(disk_ce, de, name, len, de->de_pathlen);
+	filecrc = ptr_add(*mmap, *entry_offset + len + 1 + sizeof(*disk_ce));
+	offset_to_offset = htonl(*foffsetblock);
+	foffsetblockcrc = crc32(0, (Bytef*)&offset_to_offset, 4);
+	if (!check_crc32(foffsetblockcrc,
+		ptr_add(*mmap, *entry_offset), len + 1 + sizeof(*disk_ce),
+		ntoh_l(*filecrc)))
+		return -1;
+
+	*entry_offset += len + 1 + sizeof(*disk_ce) + 4;
+	return 0;
+}
+
+static void ce_queue_push(struct cache_entry **head,
+			     struct cache_entry **tail,
+			     struct cache_entry *ce)
+{
+	if (!*head) {
+		*head = *tail = ce;
+		(*tail)->next = NULL;
+		return;
+	}
+
+	(*tail)->next = ce;
+	ce->next = NULL;
+	*tail = (*tail)->next;
+}
+
+static void conflict_entry_push(struct conflict_entry **head,
+				struct conflict_entry **tail,
+				struct conflict_entry *conflict_entry)
+{
+	if (!*head) {
+		*head = *tail = conflict_entry;
+		(*tail)->next = NULL;
+		return;
+	}
+
+	(*tail)->next = conflict_entry;
+	conflict_entry->next = NULL;
+	*tail = (*tail)->next;
+}
+
+static struct cache_entry *ce_queue_pop(struct cache_entry **head)
+{
+	struct cache_entry *ce;
+
+	ce = *head;
+	*head = (*head)->next;
+	return ce;
+}
+
+static void conflict_part_head_remove(struct conflict_part **head)
+{
+	struct conflict_part *to_free;
+
+	to_free = *head;
+	*head = (*head)->next;
+	free(to_free);
+}
+
+static void conflict_entry_head_remove(struct conflict_entry **head)
+{
+	struct conflict_entry *to_free;
+
+	to_free = *head;
+	*head = (*head)->next;
+	free(to_free);
+}
+
+struct conflict_entry *create_new_conflict(char *name, int len, int pathlen)
+{
+	struct conflict_entry *conflict_entry;
+
+	if (pathlen)
+		pathlen++;
+	conflict_entry = xmalloc(conflict_entry_size(len));
+	conflict_entry->entries = NULL;
+	conflict_entry->nfileconflicts = 0;
+	conflict_entry->namelen = len;
+	memcpy(conflict_entry->name, name, len);
+	conflict_entry->name[len] = '\0';
+	conflict_entry->pathlen = pathlen;
+	conflict_entry->next = NULL;
+
+	return conflict_entry;
+}
+
+void add_part_to_conflict_entry(struct directory_entry *de,
+					struct conflict_entry *entry,
+					struct conflict_part *conflict_part)
+{
+
+	struct conflict_part *conflict_search;
+
+	entry->nfileconflicts++;
+	de->conflict_size += sizeof(struct ondisk_conflict_part);
+	if (!entry->entries)
+		entry->entries = conflict_part;
+	else {
+		conflict_search = entry->entries;
+		while (conflict_search->next)
+			conflict_search = conflict_search->next;
+		conflict_search->next = conflict_part;
+	}
+}
+
+static int read_conflicts(struct conflict_entry **head,
+			  struct directory_entry *de,
+			  void **mmap, unsigned long mmap_size)
+{
+	struct conflict_entry *tail;
+	unsigned int croffset, i;
+	char *full_name;
+
+	croffset = de->de_cr;
+	tail = NULL;
+	for (i = 0; i < de->de_ncr; i++) {
+		struct conflict_entry *conflict_new;
+		unsigned int len, *nfileconflicts;
+		char *name;
+		void *crc_start;
+		int k, offset;
+		uint32_t *filecrc;
+
+		offset = croffset;
+		crc_start = ptr_add(*mmap, offset);
+		name = ptr_add(*mmap, offset);
+		len = strlen(name);
+		offset += len + 1;
+		nfileconflicts = ptr_add(*mmap, offset);
+		offset += 4;
+
+		full_name = xmalloc(sizeof(char) * (len + de->de_pathlen));
+		memcpy(full_name, de->pathname, de->de_pathlen);
+		memcpy(full_name + de->de_pathlen, name, len);
+		conflict_new = create_new_conflict(full_name,
+				len + de->de_pathlen, de->de_pathlen);
+		for (k = 0; k < ntoh_l(*nfileconflicts); k++) {
+			struct ondisk_conflict_part *ondisk;
+			struct conflict_part *cp;
+
+			ondisk = ptr_add(*mmap, offset);
+			cp = conflict_part_from_ondisk(ondisk);
+			cp->next = NULL;
+			add_part_to_conflict_entry(de, conflict_new, cp);
+			offset += sizeof(struct ondisk_conflict_part);
+		}
+		filecrc = ptr_add(*mmap, offset);
+		free(full_name);
+		if (!check_crc32(0, crc_start,
+			len + 1 + 4 + conflict_new->nfileconflicts
+			* sizeof(struct ondisk_conflict_part),
+			ntoh_l(*filecrc)))
+			return -1;
+		croffset = offset + 4;
+		conflict_entry_push(head, &tail, conflict_new);
+	}
+	return 0;
+}
+
+static int read_entries(struct index_state *istate, struct directory_entry **de,
+			unsigned int *entry_offset, void **mmap,
+			unsigned long mmap_size, unsigned int *nr,
+			unsigned int *foffsetblock, struct cache_entry **prev)
+{
+	struct cache_entry *head = NULL, *tail = NULL;
+	struct conflict_entry *conflict_queue;
+	struct cache_entry *ce;
+	int i;
+
+	conflict_queue = NULL;
+	if (read_conflicts(&conflict_queue, *de, mmap, mmap_size) < 0)
+		return -1;
+	for (i = 0; i < (*de)->de_nfiles; i++) {
+		if (read_entry(&ce,
+			       *de,
+			       entry_offset,
+			       mmap,
+			       mmap_size,
+			       foffsetblock) < 0)
+			return -1;
+		ce_queue_push(&head, &tail, ce);
+		*foffsetblock += 4;
+
+		/*
+		 * Add the conflicted entries at the end of the index file
+		 * to the in memory format
+		 */
+		if (conflict_queue &&
+		    (conflict_queue->entries->flags & CONFLICT_CONFLICTED) != 0 &&
+		    !cache_name_compare(conflict_queue->name, conflict_queue->namelen,
+					ce->name, ce_namelen(ce))) {
+			struct conflict_part *cp;
+			cp = conflict_queue->entries;
+			cp = cp->next;
+			while (cp) {
+				ce = convert_conflict_part(cp,
+						conflict_queue->name,
+						conflict_queue->namelen);
+				ce_queue_push(&head, &tail, ce);
+				conflict_part_head_remove(&cp);
+			}
+			conflict_entry_head_remove(&conflict_queue);
+		}
+	}
+
+	*de = (*de)->next;
+
+	while (head) {
+		if (*de != NULL
+		    && strcmp(head->name, (*de)->pathname) > 0) {
+			read_entries(istate,
+				     de,
+				     entry_offset,
+				     mmap,
+				     mmap_size,
+				     nr,
+				     foffsetblock,
+				     prev);
+		} else {
+			ce = ce_queue_pop(&head);
+			set_index_entry(istate, *nr, ce);
+			if (*prev)
+				(*prev)->next_ce = ce;
+			(*nr)++;
+			*prev = ce;
+			ce->next = NULL;
+		}
+	}
+	return 0;
+}
+
+static struct directory_entry *read_head_directories(struct index_state *istate,
+						     unsigned int *entry_offset,
+						     unsigned int *foffsetblock,
+						     unsigned int *ndirs,
+						     void *mmap, unsigned long mmap_size)
+{
+	unsigned int dir_offset, dir_table_offset;
+	struct cache_version_header *hdr;
+	struct cache_header *hdr_v5;
+	struct directory_entry *root_directory;
+
+	hdr = mmap;
+	hdr_v5 = ptr_add(mmap, sizeof(*hdr));
+	istate->version = ntohl(hdr->hdr_version);
+	istate->cache_alloc = alloc_nr(ntohl(hdr_v5->hdr_nfile));
+	istate->cache = xcalloc(istate->cache_alloc, sizeof(struct cache_entry *));
+	istate->initialized = 1;
+
+	/* Skip size of the header + crc sum + size of offsets */
+	dir_offset = sizeof(*hdr) + sizeof(*hdr_v5) + 4 + (ntohl(hdr_v5->hdr_ndir) + 1) * 4;
+	dir_table_offset = sizeof(*hdr) + sizeof(*hdr_v5) + 4;
+	root_directory = read_directories(&dir_offset, &dir_table_offset,
+					  mmap, mmap_size);
+
+	*entry_offset = ntohl(hdr_v5->hdr_fblockoffset);
+	*foffsetblock = dir_offset;
+	*ndirs = ntohl(hdr_v5->hdr_ndir);
+	return root_directory;
+}
+
+static int read_index_filtered_v5(struct index_state *istate, void *mmap,
+				  unsigned long mmap_size, struct filter_opts *opts)
+{
+	unsigned int entry_offset, ndirs, foffsetblock, nr = 0;
+	struct directory_entry *root_directory, *de;
+	int i, n;
+	const char **adjusted_pathspec = NULL;
+	int need_root = 1;
+	char *seen, *oldpath;
+	struct cache_entry *prev = NULL;
+
+	root_directory = read_head_directories(istate, &entry_offset,
+					       &foffsetblock, &ndirs,
+					       mmap, mmap_size);
+
+	if (opts && opts->pathspec) {
+		need_root = 0;
+		seen = xcalloc(1, ndirs);
+		for (de = root_directory; de; de = de->next)
+			match_pathspec(opts->pathspec, de->pathname, de->de_pathlen, 0, seen);
+		for (n = 0; opts->pathspec[n]; n++)
+			/* just count */;
+		adjusted_pathspec = xmalloc((n+1)*sizeof(char *));
+		adjusted_pathspec[n] = NULL;
+		for (i = 0; i < n; i++) {
+			if (seen[i] == MATCHED_EXACTLY)
+				adjusted_pathspec[i] = opts->pathspec[i];
+			else {
+				char *super = strdup(opts->pathspec[i]);
+				int len = strlen(super);
+				while (len && super[len - 1] == '/')
+					super[--len] = '\0'; /* strip trailing / */
+				while (len && super[--len] != '/')
+					; /* scan backwards to next / */
+				if (len >= 0)
+					super[len--] = '\0';
+				if (len <= 0) {
+					need_root = 1;
+					break;
+				}
+				adjusted_pathspec[i] = super;
+			}
+		}
+	}
+
+	de = root_directory;
+	while (de) {
+		if (need_root ||
+		    match_pathspec(adjusted_pathspec, de->pathname, de->de_pathlen, 0, NULL)) {
+			unsigned int subdir_foffsetblock = de->de_foffset + foffsetblock;
+			unsigned int *off = mmap + subdir_foffsetblock;
+			unsigned int subdir_entry_offset = entry_offset + ntoh_l(*off);
+			oldpath = de->pathname;
+			do {
+				if (read_entries(istate, &de, &subdir_entry_offset,
+						 &mmap, mmap_size, &nr,
+						 &subdir_foffsetblock, &prev) < 0)
+					return -1;
+			} while (de && !prefixcmp(de->pathname, oldpath));
+		} else
+			de = de->next;
+	}
+	istate->cache_nr = nr;
+	istate->partially_read = 1;
+	return 0;
+}
+
+static int read_index_v5(struct index_state *istate, void *mmap,
+			 unsigned long mmap_size, struct filter_opts *opts)
+{
+	unsigned int entry_offset, ndirs, foffsetblock, nr = 0;
+	struct directory_entry *root_directory, *de;
+	struct cache_entry *prev = NULL;
+
+	if (opts != NULL)
+		return read_index_filtered_v5(istate, mmap, mmap_size, opts);
+
+	root_directory = read_head_directories(istate, &entry_offset,
+					       &foffsetblock, &ndirs,
+					       mmap, mmap_size);
+	de = root_directory;
+	while (de)
+		if (read_entries(istate, &de, &entry_offset, &mmap,
+				 mmap_size, &nr, &foffsetblock, &prev) < 0)
+			return -1;
+	istate->cache_nr = nr;
+	istate->partially_read = 0;
+	return 0;
+}
+
+static void index_change_filter_opts_v5(struct index_state *istate, struct filter_opts *opts)
+{
+	if (istate->initialized == 1 &&
+	    (((istate->filter_opts == NULL || opts == NULL) && istate->filter_opts != opts)
+	     || (!memcmp(istate->filter_opts, opts, sizeof(*opts)))))
+		return;
+	discard_index(istate);
+	read_index_filtered(istate, opts);
+}
+
+struct index_ops v5_ops = {
+	match_stat_basic,
+	verify_hdr,
+	read_index_v5,
+	NULL,
+	index_change_filter_opts_v5
+};
diff --git a/read-cache.h b/read-cache.h
index ce9b79c..fe53c8e 100644
--- a/read-cache.h
+++ b/read-cache.h
@@ -39,6 +39,7 @@ struct internal_ops {
 
 extern struct index_ops v2_ops;
 extern struct internal_ops v2_internal_ops;
+extern struct index_ops v5_ops;
 
 #ifndef NEEDS_ALIGNED_ACCESS
 #define ntoh_s(var) ntohs(var)
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 16/22] read-cache: read resolve-undo data
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (14 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 15/22] read-cache: read index-v5 Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 17/22] read-cache: read cache-tree in index-v5 Thomas Gummerer
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Make git read the resolve-undo data from the index.

Since the resolve-undo data is joined with the conflicts in
the ondisk format of the index file version 5, conflicts and
resolved data is read at the same time, and the resolve-undo
data is then converted to the in-memory format.

Helped-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 read-cache-v5.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/read-cache-v5.c b/read-cache-v5.c
index e319f30..193970a 100644
--- a/read-cache-v5.c
+++ b/read-cache-v5.c
@@ -1,5 +1,6 @@
 #include "cache.h"
 #include "read-cache.h"
+#include "string-list.h"
 #include "resolve-undo.h"
 #include "cache-tree.h"
 #include "dir.h"
@@ -447,6 +448,43 @@ static int read_conflicts(struct conflict_entry **head,
 	return 0;
 }
 
+static void resolve_undo_convert_v5(struct index_state *istate,
+				    struct conflict_entry *conflict)
+{
+	int i;
+
+	while (conflict) {
+		struct string_list_item *lost;
+		struct resolve_undo_info *ui;
+		struct conflict_part *cp;
+
+		if (conflict->entries &&
+		    (conflict->entries->flags & CONFLICT_CONFLICTED) != 0) {
+			conflict = conflict->next;
+			continue;
+		}
+		if (!istate->resolve_undo) {
+			istate->resolve_undo = xcalloc(1, sizeof(struct string_list));
+			istate->resolve_undo->strdup_strings = 1;
+		}
+
+		lost = string_list_insert(istate->resolve_undo, conflict->name);
+		if (!lost->util)
+			lost->util = xcalloc(1, sizeof(*ui));
+		ui = lost->util;
+
+		cp = conflict->entries;
+		for (i = 0; i < 3; i++)
+			ui->mode[i] = 0;
+		while (cp) {
+			ui->mode[conflict_stage(cp) - 1] = cp->entry_mode;
+			hashcpy(ui->sha1[conflict_stage(cp) - 1], cp->sha1);
+			cp = cp->next;
+		}
+		conflict = conflict->next;
+	}
+}
+
 static int read_entries(struct index_state *istate, struct directory_entry **de,
 			unsigned int *entry_offset, void **mmap,
 			unsigned long mmap_size, unsigned int *nr,
@@ -460,6 +498,7 @@ static int read_entries(struct index_state *istate, struct directory_entry **de,
 	conflict_queue = NULL;
 	if (read_conflicts(&conflict_queue, *de, mmap, mmap_size) < 0)
 		return -1;
+	resolve_undo_convert_v5(istate, conflict_queue);
 	for (i = 0; i < (*de)->de_nfiles; i++) {
 		if (read_entry(&ce,
 			       *de,
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 17/22] read-cache: read cache-tree in index-v5
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (15 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 16/22] read-cache: read resolve-undo data Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07 20:41   ` Eric Sunshine
  2013-07-07  8:11 ` [PATCH 18/22] read-cache: write index-v5 Thomas Gummerer
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Since the cache-tree data is saved as part of the directory data,
we already read it at the beginning of the index. The cache-tree
is only converted from this directory data.

The cache-tree data is arranged in a tree, with the children sorted by
pathlen at each node, while the ondisk format is sorted lexically.
So we have to rebuild this format from the on-disk directory list.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 cache-tree.c    |   2 +-
 cache-tree.h    |   6 ++++
 read-cache-v5.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 37e4d00..f4b0917 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -31,7 +31,7 @@ void cache_tree_free(struct cache_tree **it_p)
 	*it_p = NULL;
 }
 
-static int subtree_name_cmp(const char *one, int onelen,
+int subtree_name_cmp(const char *one, int onelen,
 			    const char *two, int twolen)
 {
 	if (onelen < twolen)
diff --git a/cache-tree.h b/cache-tree.h
index 55d0f59..9aac493 100644
--- a/cache-tree.h
+++ b/cache-tree.h
@@ -21,10 +21,16 @@ struct cache_tree {
 	struct cache_tree_sub **down;
 };
 
+struct directory_queue {
+	struct directory_queue *down;
+	struct directory_entry *de;
+};
+
 struct cache_tree *cache_tree(void);
 void cache_tree_free(struct cache_tree **);
 void cache_tree_invalidate_path(struct cache_tree *, const char *);
 struct cache_tree_sub *cache_tree_sub(struct cache_tree *, const char *);
+int subtree_name_cmp(const char *, int, const char *, int);
 
 void cache_tree_write(struct strbuf *, struct cache_tree *root);
 struct cache_tree *cache_tree_read(const char *buffer, unsigned long size);
diff --git a/read-cache-v5.c b/read-cache-v5.c
index 193970a..f1ad132 100644
--- a/read-cache-v5.c
+++ b/read-cache-v5.c
@@ -448,6 +448,103 @@ static int read_conflicts(struct conflict_entry **head,
 	return 0;
 }
 
+static struct cache_tree *convert_one(struct directory_queue *queue, int dirnr)
+{
+	int i, subtree_nr;
+	struct cache_tree *it;
+	struct directory_queue *down;
+
+	it = cache_tree();
+	it->entry_count = queue[dirnr].de->de_nentries;
+	subtree_nr = queue[dirnr].de->de_nsubtrees;
+	if (0 <= it->entry_count)
+		hashcpy(it->sha1, queue[dirnr].de->sha1);
+
+	/*
+	 * Just a heuristic -- we do not add directories that often but
+	 * we do not want to have to extend it immediately when we do,
+	 * hence +2.
+	 */
+	it->subtree_alloc = subtree_nr + 2;
+	it->down = xcalloc(it->subtree_alloc, sizeof(struct cache_tree_sub *));
+	down = queue[dirnr].down;
+	for (i = 0; i < subtree_nr; i++) {
+		struct cache_tree *sub;
+		struct cache_tree_sub *subtree;
+		char *buf, *name;
+
+		name = "";
+		buf = strtok(down[i].de->pathname, "/");
+		while (buf) {
+			name = buf;
+			buf = strtok(NULL, "/");
+		}
+		sub = convert_one(down, i);
+		if(!sub)
+			goto free_return;
+		subtree = cache_tree_sub(it, name);
+		subtree->cache_tree = sub;
+	}
+	if (subtree_nr != it->subtree_nr)
+		die("cache-tree: internal error");
+	return it;
+ free_return:
+	cache_tree_free(&it);
+	return NULL;
+}
+
+static int compare_cache_tree_elements(const void *a, const void *b)
+{
+	const struct directory_entry *de1, *de2;
+
+	de1 = ((const struct directory_queue *)a)->de;
+	de2 = ((const struct directory_queue *)b)->de;
+	return subtree_name_cmp(de1->pathname, de1->de_pathlen,
+				de2->pathname, de2->de_pathlen);
+}
+
+static struct directory_entry *sort_directories(struct directory_entry *de,
+						struct directory_queue *queue)
+{
+	int i, nsubtrees;
+
+	nsubtrees = de->de_nsubtrees;
+	for (i = 0; i < nsubtrees; i++) {
+		struct directory_entry *new_de;
+		de = de->next;
+		new_de = xmalloc(directory_entry_size(de->de_pathlen));
+		memcpy(new_de, de, directory_entry_size(de->de_pathlen));
+		queue[i].de = new_de;
+		if (de->de_nsubtrees) {
+			queue[i].down = xcalloc(de->de_nsubtrees,
+					sizeof(struct directory_queue));
+			de = sort_directories(de,
+					queue[i].down);
+		}
+	}
+	qsort(queue, nsubtrees, sizeof(struct directory_queue),
+			compare_cache_tree_elements);
+	return de;
+}
+
+/*
+ * This function modifys the directory argument that is given to it.
+ * Don't use it if the directory entries are still needed after.
+ */
+static struct cache_tree *cache_tree_convert_v5(struct directory_entry *de)
+{
+	struct directory_queue *queue;
+
+	if (!de->de_nentries)
+		return NULL;
+	queue = xcalloc(1, sizeof(struct directory_queue));
+	queue[0].de = de;
+	queue[0].down = xcalloc(de->de_nsubtrees, sizeof(struct directory_queue));
+
+	sort_directories(de, queue[0].down);
+	return convert_one(queue, 0);
+}
+
 static void resolve_undo_convert_v5(struct index_state *istate,
 				    struct conflict_entry *conflict)
 {
@@ -650,6 +747,7 @@ static int read_index_filtered_v5(struct index_state *istate, void *mmap,
 		} else
 			de = de->next;
 	}
+	istate->cache_tree = cache_tree_convert_v5(root_directory);
 	istate->cache_nr = nr;
 	istate->partially_read = 1;
 	return 0;
@@ -673,6 +771,8 @@ static int read_index_v5(struct index_state *istate, void *mmap,
 		if (read_entries(istate, &de, &entry_offset, &mmap,
 				 mmap_size, &nr, &foffsetblock, &prev) < 0)
 			return -1;
+
+	istate->cache_tree = cache_tree_convert_v5(root_directory);
 	istate->cache_nr = nr;
 	istate->partially_read = 0;
 	return 0;
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 18/22] read-cache: write index-v5
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (16 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 17/22] read-cache: read cache-tree in index-v5 Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07 20:43   ` Eric Sunshine
  2013-07-07  8:11 ` [PATCH 19/22] read-cache: write index-v5 cache-tree data Thomas Gummerer
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Write the index version 5 file format to disk. This version doesn't
write the cache-tree data and resolve-undo data to the file.

The main work is done when filtering out the directories from the
current in-memory format, where in the same turn also the conflicts
and the file data is calculated.

Helped-by: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
Helped-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 cache.h         |   8 +
 read-cache-v5.c | 594 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 read-cache.c    |  11 +-
 read-cache.h    |   1 +
 4 files changed, 611 insertions(+), 3 deletions(-)

diff --git a/cache.h b/cache.h
index e110ec8..a92b490 100644
--- a/cache.h
+++ b/cache.h
@@ -581,6 +581,7 @@ extern int unmerged_index(const struct index_state *);
 extern int verify_path(const char *path);
 extern struct cache_entry *index_name_exists(struct index_state *istate, const char *name, int namelen, int igncase);
 extern int index_name_pos(const struct index_state *, const char *name, int namelen);
+extern struct directory_entry *init_directory_entry(char *pathname, int len);
 #define ADD_CACHE_OK_TO_ADD 1		/* Ok to add */
 #define ADD_CACHE_OK_TO_REPLACE 2	/* Ok to replace file/directory */
 #define ADD_CACHE_SKIP_DFCHECK 4	/* Ok to skip DF conflict checks */
@@ -1379,6 +1380,13 @@ static inline ssize_t write_str_in_full(int fd, const char *str)
 	return write_in_full(fd, str, strlen(str));
 }
 
+/* index-v5 helper functions */
+extern char *super_directory(const char *filename);
+extern void insert_directory_entry(struct directory_entry *, struct hash_table *, int *, unsigned int *, uint32_t);
+extern void add_conflict_to_directory_entry(struct directory_entry *, struct conflict_entry *);
+extern void add_part_to_conflict_entry(struct directory_entry *, struct conflict_entry *, struct conflict_part *);
+extern struct conflict_entry *create_new_conflict(char *, int, int);
+
 /* pager.c */
 extern void setup_pager(void);
 extern const char *pager_program;
diff --git a/read-cache-v5.c b/read-cache-v5.c
index f1ad132..f056f6b 100644
--- a/read-cache-v5.c
+++ b/read-cache-v5.c
@@ -788,10 +788,602 @@ static void index_change_filter_opts_v5(struct index_state *istate, struct filte
 	read_index_filtered(istate, opts);
 }
 
+#define WRITE_BUFFER_SIZE 8192
+static unsigned char write_buffer[WRITE_BUFFER_SIZE];
+static unsigned long write_buffer_len;
+
+static int ce_write_flush(int fd)
+{
+	unsigned int buffered = write_buffer_len;
+	if (buffered) {
+		if (write_in_full(fd, write_buffer, buffered) != buffered)
+			return -1;
+		write_buffer_len = 0;
+	}
+	return 0;
+}
+
+static int ce_write(uint32_t *crc, int fd, void *data, unsigned int len)
+{
+	if (crc)
+		*crc = crc32(*crc, (Bytef*)data, len);
+	while (len) {
+		unsigned int buffered = write_buffer_len;
+		unsigned int partial = WRITE_BUFFER_SIZE - buffered;
+		if (partial > len)
+			partial = len;
+		memcpy(write_buffer + buffered, data, partial);
+		buffered += partial;
+		if (buffered == WRITE_BUFFER_SIZE) {
+			write_buffer_len = buffered;
+			if (ce_write_flush(fd))
+				return -1;
+			buffered = 0;
+		}
+		write_buffer_len = buffered;
+		len -= partial;
+		data = (char *) data + partial;
+	}
+	return 0;
+}
+
+static int ce_flush(int fd)
+{
+	unsigned int left = write_buffer_len;
+
+	if (left)
+		write_buffer_len = 0;
+
+	if (write_in_full(fd, write_buffer, left) != left)
+		return -1;
+
+	return 0;
+}
+
+static void ce_smudge_racily_clean_entry(struct cache_entry *ce)
+{
+	/*
+	 * This method shall only be called if the timestamp of ce
+	 * is racy (check with is_racy_timestamp). If the timestamp
+	 * is racy, the writer will set the CE_SMUDGED flag.
+	 *
+	 * The reader (match_stat_basic) will then take care
+	 * of checking if the entry is really changed or not, by
+	 * taking into account the size and the stat_crc and if
+	 * that hasn't changed checking the sha1.
+	 */
+	ce->ce_flags |= CE_SMUDGED;
+}
+
+char *super_directory(const char *filename)
+{
+	char *slash;
+
+	slash = strrchr(filename, '/');
+	if (slash)
+		return xmemdupz(filename, slash-filename);
+	return NULL;
+}
+
+struct directory_entry *init_directory_entry(char *pathname, int len)
+{
+	struct directory_entry *de = xmalloc(directory_entry_size(len));
+
+	memcpy(de->pathname, pathname, len);
+	de->pathname[len] = '\0';
+	de->de_flags      = 0;
+	de->de_foffset    = 0;
+	de->de_cr         = 0;
+	de->de_ncr        = 0;
+	de->de_nsubtrees  = 0;
+	de->de_nfiles     = 0;
+	de->de_nentries   = 0;
+	memset(de->sha1, 0, 20);
+	de->de_pathlen    = len;
+	de->next          = NULL;
+	de->next_hash     = NULL;
+	de->ce            = NULL;
+	de->ce_last       = NULL;
+	de->conflict      = NULL;
+	de->conflict_last = NULL;
+	de->conflict_size = 0;
+	return de;
+}
+
+static void ondisk_from_directory_entry(struct directory_entry *de,
+					struct ondisk_directory_entry *ondisk)
+{
+	ondisk->foffset   = htonl(de->de_foffset);
+	ondisk->cr        = htonl(de->de_cr);
+	ondisk->ncr       = htonl(de->de_ncr);
+	ondisk->nsubtrees = htonl(de->de_nsubtrees);
+	ondisk->nfiles    = htonl(de->de_nfiles);
+	ondisk->nentries  = htonl(de->de_nentries);
+	hashcpy(ondisk->sha1, de->sha1);
+	ondisk->flags     = htons(de->de_flags);
+}
+
+static struct conflict_part *conflict_part_from_inmemory(struct cache_entry *ce)
+{
+	struct conflict_part *conflict;
+	int flags;
+
+	conflict = xmalloc(sizeof(struct conflict_part));
+	flags                = CONFLICT_CONFLICTED;
+	flags               |= ce_stage(ce) << CONFLICT_STAGESHIFT;
+	conflict->flags      = flags;
+	conflict->entry_mode = ce->ce_mode;
+	conflict->next       = NULL;
+	hashcpy(conflict->sha1, ce->sha1);
+	return conflict;
+}
+
+static void conflict_to_ondisk(struct conflict_part *cp,
+				struct ondisk_conflict_part *ondisk)
+{
+	ondisk->flags      = htons(cp->flags);
+	ondisk->entry_mode = htons(cp->entry_mode);
+	hashcpy(ondisk->sha1, cp->sha1);
+}
+
+void add_conflict_to_directory_entry(struct directory_entry *de,
+					struct conflict_entry *conflict_entry)
+{
+	de->de_ncr++;
+	de->conflict_size += conflict_entry->namelen + 1 + 8 - conflict_entry->pathlen;
+	conflict_entry_push(&de->conflict, &de->conflict_last, conflict_entry);
+}
+
+void insert_directory_entry(struct directory_entry *de,
+			struct hash_table *table,
+			int *total_dir_len,
+			unsigned int *ndir,
+			uint32_t crc)
+{
+	struct directory_entry *insert;
+
+	insert = (struct directory_entry *)insert_hash(crc, de, table);
+	if (insert) {
+		de->next_hash = insert->next_hash;
+		insert->next_hash = de;
+	}
+	(*ndir)++;
+	if (de->de_pathlen == 0)
+		(*total_dir_len)++;
+	else
+		*total_dir_len += de->de_pathlen + 2;
+}
+
+static struct conflict_entry *create_conflict_entry_from_ce(struct cache_entry *ce,
+								int pathlen)
+{
+	return create_new_conflict(ce->name, ce_namelen(ce), pathlen);
+}
+
+static struct directory_entry *compile_directory_data(struct index_state *istate,
+						int nfile,
+						unsigned int *ndir,
+						int *non_conflicted,
+						int *total_dir_len,
+						int *total_file_len)
+{
+	int i, dir_len = -1;
+	char *dir;
+	struct directory_entry *de, *current, *search, *found, *new, *previous_entry;
+	struct cache_entry **cache = istate->cache;
+	struct conflict_entry *conflict_entry;
+	struct hash_table table;
+	uint32_t crc;
+
+	init_hash(&table);
+	de = init_directory_entry("", 0);
+	current = de;
+	*ndir = 1;
+	*total_dir_len = 1;
+	crc = crc32(0, (Bytef*)de->pathname, de->de_pathlen);
+	insert_hash(crc, de, &table);
+	conflict_entry = NULL;
+	for (i = 0; i < nfile; i++) {
+		int new_entry;
+		if (cache[i]->ce_flags & CE_REMOVE)
+			continue;
+
+		new_entry = !ce_stage(cache[i]) || !conflict_entry
+		    || cache_name_compare(conflict_entry->name, conflict_entry->namelen,
+					cache[i]->name, ce_namelen(cache[i]));
+		if (new_entry)
+			(*non_conflicted)++;
+		if (dir_len < 0 || strncmp(cache[i]->name, dir, dir_len)
+		    || cache[i]->name[dir_len] != '/'
+		    || strchr(cache[i]->name + dir_len + 1, '/')) {
+			dir = super_directory(cache[i]->name);
+			if (!dir)
+				dir_len = 0;
+			else
+				dir_len = strlen(dir);
+			crc = crc32(0, (Bytef*)dir, dir_len);
+			found = lookup_hash(crc, &table);
+			search = found;
+			while (search && dir_len != 0 && strcmp(dir, search->pathname) != 0)
+				search = search->next_hash;
+		}
+		previous_entry = current;
+		if (!search || !found) {
+			new = init_directory_entry(dir, dir_len);
+			current->next = new;
+			current = current->next;
+			insert_directory_entry(new, &table, total_dir_len, ndir, crc);
+			search = current;
+		}
+		if (new_entry) {
+			search->de_nfiles++;
+			*total_file_len += ce_namelen(cache[i]) + 1;
+			if (search->de_pathlen)
+				*total_file_len -= search->de_pathlen + 1;
+			ce_queue_push(&(search->ce), &(search->ce_last), cache[i]);
+		}
+		if (ce_stage(cache[i]) > 0) {
+			struct conflict_part *conflict_part;
+			if (new_entry) {
+				conflict_entry = create_conflict_entry_from_ce(cache[i], search->de_pathlen);
+				add_conflict_to_directory_entry(search, conflict_entry);
+			}
+			conflict_part = conflict_part_from_inmemory(cache[i]);
+			add_part_to_conflict_entry(search, conflict_entry, conflict_part);
+		}
+		if (dir && !found) {
+			struct directory_entry *no_subtrees;
+
+			no_subtrees = current;
+			dir = super_directory(dir);
+			if (dir)
+				dir_len = strlen(dir);
+			else
+				dir_len = 0;
+			crc = crc32(0, (Bytef*)dir, dir_len);
+			found = lookup_hash(crc, &table);
+			while (!found) {
+				new = init_directory_entry(dir, dir_len);
+				new->de_nsubtrees = 1;
+				new->next = no_subtrees;
+				no_subtrees = new;
+				insert_directory_entry(new, &table, total_dir_len, ndir, crc);
+				dir = super_directory(dir);
+				if (!dir)
+					dir_len = 0;
+				else
+					dir_len = strlen(dir);
+				crc = crc32(0, (Bytef*)dir, dir_len);
+				found = lookup_hash(crc, &table);
+			}
+			search = found;
+			while (search->next_hash && strcmp(dir, search->pathname) != 0)
+				search = search->next_hash;
+			if (search)
+				found = search;
+			found->de_nsubtrees++;
+			previous_entry->next = no_subtrees;
+		}
+	}
+	return de;
+}
+
+static void ondisk_from_cache_entry(struct cache_entry *ce,
+				    struct ondisk_cache_entry *ondisk)
+{
+	unsigned int flags;
+
+	flags  = ce->ce_flags & CE_STAGEMASK;
+	flags |= ce->ce_flags & CE_VALID;
+	flags |= ce->ce_flags & CE_SMUDGED;
+	if (ce->ce_flags & CE_INTENT_TO_ADD)
+		flags |= CE_INTENT_TO_ADD_V5;
+	if (ce->ce_flags & CE_SKIP_WORKTREE)
+		flags |= CE_SKIP_WORKTREE_V5;
+	ondisk->flags      = htons(flags);
+	ondisk->mode       = htons(ce->ce_mode);
+	ondisk->mtime.sec  = htonl(ce->ce_stat_data.sd_mtime.sec);
+#ifdef USE_NSEC
+	ondisk->mtime.nsec = htonl(ce->ce_stat_data.sd_mtime.nsec);
+#else
+	ondisk->mtime.nsec = 0;
+#endif
+	ondisk->size       = htonl(ce->ce_stat_data.sd_size);
+	if (!ce->ce_stat_crc)
+		ce->ce_stat_crc = calculate_stat_crc(ce);
+	ondisk->stat_crc   = htonl(ce->ce_stat_crc);
+	hashcpy(ondisk->sha1, ce->sha1);
+}
+
+static int write_directories(struct directory_entry *de, int fd, int conflict_offset)
+{
+	struct directory_entry *current;
+	struct ondisk_directory_entry ondisk;
+	int current_offset, offset_write, ondisk_size, foffset;
+	uint32_t crc;
+
+	/*
+	 * This is needed because the compiler aligns structs to sizes multipe
+	 * of 4
+	 */
+	ondisk_size = sizeof(ondisk.flags)
+		+ sizeof(ondisk.foffset)
+		+ sizeof(ondisk.cr)
+		+ sizeof(ondisk.ncr)
+		+ sizeof(ondisk.nsubtrees)
+		+ sizeof(ondisk.nfiles)
+		+ sizeof(ondisk.nentries)
+		+ sizeof(ondisk.sha1);
+	current = de;
+	current_offset = 0;
+	foffset = 0;
+	while (current) {
+		int pathlen;
+
+		offset_write = htonl(current_offset);
+		if (ce_write(NULL, fd, &offset_write, 4) < 0)
+			return -1;
+		if (current->de_pathlen == 0)
+			pathlen = 0;
+		else
+			pathlen = current->de_pathlen + 1;
+		current_offset += pathlen + 1 + ondisk_size + 4;
+		current = current->next;
+	}
+	/*
+	 * Write one more offset, which points to the end of the entries,
+	 * because we use it for calculating the dir length, instead of
+	 * using strlen.
+	 */
+	offset_write = htonl(current_offset);
+	if (ce_write(NULL, fd, &offset_write, 4) < 0)
+		return -1;
+	current = de;
+	while (current) {
+		crc = 0;
+		if (current->de_pathlen == 0) {
+			if (ce_write(&crc, fd, current->pathname, 1) < 0)
+				return -1;
+		} else {
+			char *path;
+			path = xmalloc(sizeof(char) * (current->de_pathlen + 2));
+			memcpy(path, current->pathname, current->de_pathlen);
+			memcpy(path + current->de_pathlen, "/\0", 2);
+			if (ce_write(&crc, fd, path, current->de_pathlen + 2) < 0)
+				return -1;
+		}
+		current->de_foffset = foffset;
+		current->de_cr = conflict_offset;
+		ondisk_from_directory_entry(current, &ondisk);
+		if (ce_write(&crc, fd, &ondisk, ondisk_size) < 0)
+			return -1;
+		crc = htonl(crc);
+		if (ce_write(NULL, fd, &crc, 4) < 0)
+			return -1;
+		conflict_offset += current->conflict_size;
+		foffset += current->de_nfiles * 4;
+		current = current->next;
+	}
+	return 0;
+}
+
+static int write_entries(struct index_state *istate,
+			    struct directory_entry *de,
+			    int entries,
+			    int fd,
+			    int offset_to_offset)
+{
+	int offset, offset_write, ondisk_size;
+	struct directory_entry *current;
+
+	offset = 0;
+	ondisk_size = sizeof(struct ondisk_cache_entry);
+	current = de;
+	while (current) {
+		int pathlen;
+		struct cache_entry *ce = current->ce;
+
+		if (current->de_pathlen == 0)
+			pathlen = 0;
+		else
+			pathlen = current->de_pathlen + 1;
+		while (ce) {
+			if (ce->ce_flags & CE_REMOVE)
+				continue;
+			if (!ce_uptodate(ce) && is_racy_timestamp(istate, ce))
+				ce_smudge_racily_clean_entry(ce);
+			if (is_null_sha1(ce->sha1))
+				return error("cache entry has null sha1: %s", ce->name);
+
+			offset_write = htonl(offset);
+			if (ce_write(NULL, fd, &offset_write, 4) < 0)
+				return -1;
+			offset += ce_namelen(ce) - pathlen + 1 + ondisk_size + 4;
+			ce = ce->next;
+		}
+		current = current->next;
+	}
+	/*
+	 * Write one more offset, which points to the end of the entries,
+	 * because we use it for calculating the file length, instead of
+	 * using strlen.
+	 */
+	offset_write = htonl(offset);
+	if (ce_write(NULL, fd, &offset_write, 4) < 0)
+		return -1;
+
+	offset = offset_to_offset;
+	current = de;
+	while (current) {
+		int pathlen;
+		struct cache_entry *ce = current->ce;
+
+		if (current->de_pathlen == 0)
+			pathlen = 0;
+		else
+			pathlen = current->de_pathlen + 1;
+		while (ce) {
+			struct ondisk_cache_entry ondisk;
+			uint32_t crc, calc_crc;
+
+			if (ce->ce_flags & CE_REMOVE)
+				continue;
+			calc_crc = htonl(offset);
+			crc = crc32(0, (Bytef*)&calc_crc, 4);
+			if (ce_write(&crc, fd, ce->name + pathlen,
+					ce_namelen(ce) - pathlen + 1) < 0)
+				return -1;
+			ondisk_from_cache_entry(ce, &ondisk);
+			if (ce_write(&crc, fd, &ondisk, ondisk_size) < 0)
+				return -1;
+			crc = htonl(crc);
+			if (ce_write(NULL, fd, &crc, 4) < 0)
+				return -1;
+			offset += 4;
+			ce = ce->next;
+		}
+		current = current->next;
+	}
+	return 0;
+}
+
+static int write_conflict(struct conflict_entry *conflict, int fd)
+{
+	struct conflict_entry *current;
+	struct conflict_part *current_part;
+	uint32_t crc;
+
+	current = conflict;
+	while (current) {
+		unsigned int to_write;
+
+		crc = 0;
+		if (ce_write(&crc, fd,
+		     (Bytef*)(current->name + current->pathlen),
+		     current->namelen - current->pathlen) < 0)
+			return -1;
+		if (ce_write(&crc, fd, (Bytef*)"\0", 1) < 0)
+			return -1;
+		to_write = htonl(current->nfileconflicts);
+		if (ce_write(&crc, fd, (Bytef*)&to_write, 4) < 0)
+			return -1;
+		current_part = current->entries;
+		while (current_part) {
+			struct ondisk_conflict_part ondisk;
+
+			conflict_to_ondisk(current_part, &ondisk);
+			if (ce_write(&crc, fd, (Bytef*)&ondisk, sizeof(struct ondisk_conflict_part)) < 0)
+				return 0;
+			current_part = current_part->next;
+		}
+		to_write = htonl(crc);
+		if (ce_write(NULL, fd, (Bytef*)&to_write, 4) < 0)
+			return -1;
+		current = current->next;
+	}
+	return 0;
+}
+
+static int write_conflicts(struct index_state *istate,
+			      struct directory_entry *de,
+			      int fd)
+{
+	struct directory_entry *current;
+
+	current = de;
+	while (current) {
+		if (current->de_ncr != 0) {
+			if (write_conflict(current->conflict, fd) < 0)
+				return -1;
+		}
+		current = current->next;
+	}
+	return 0;
+}
+
+static int write_index_v5(struct index_state *istate, int newfd)
+{
+	struct cache_version_header hdr;
+	struct cache_header hdr_v5;
+	struct cache_entry **cache = istate->cache;
+	struct directory_entry *de;
+	struct ondisk_directory_entry *ondisk;
+	int entries = istate->cache_nr;
+	int i, removed, non_conflicted, total_dir_len, ondisk_directory_size;
+	int total_file_len, conflict_offset, offset_to_offset;
+	unsigned int ndir;
+	uint32_t crc;
+
+	if (istate->partially_read)
+		die("BUG: index: cannot write a partially read index");
+
+	for (i = removed = 0; i < entries; i++) {
+		if (cache[i]->ce_flags & CE_REMOVE)
+			removed++;
+	}
+	hdr.hdr_signature = htonl(CACHE_SIGNATURE);
+	hdr.hdr_version = htonl(istate->version);
+	hdr_v5.hdr_nfile = htonl(entries - removed);
+	hdr_v5.hdr_nextension = htonl(0); /* Currently no extensions are supported */
+
+	non_conflicted = 0;
+	total_dir_len = 0;
+	total_file_len = 0;
+	de = compile_directory_data(istate, entries, &ndir, &non_conflicted,
+			&total_dir_len, &total_file_len);
+	hdr_v5.hdr_ndir = htonl(ndir);
+
+	/*
+	 * This is needed because the compiler aligns structs to sizes multipe
+	 * of 4
+	 */
+	ondisk_directory_size = sizeof(ondisk->flags)
+		+ sizeof(ondisk->foffset)
+		+ sizeof(ondisk->cr)
+		+ sizeof(ondisk->ncr)
+		+ sizeof(ondisk->nsubtrees)
+		+ sizeof(ondisk->nfiles)
+		+ sizeof(ondisk->nentries)
+		+ sizeof(ondisk->sha1);
+	hdr_v5.hdr_fblockoffset = htonl(sizeof(hdr) + sizeof(hdr_v5) + 4
+		+ (ndir + 1) * 4
+		+ total_dir_len
+		+ ndir * (ondisk_directory_size + 4)
+		+ (non_conflicted + 1) * 4);
+
+	crc = 0;
+	if (ce_write(&crc, newfd, &hdr, sizeof(hdr)) < 0)
+		return -1;
+	if (ce_write(&crc, newfd, &hdr_v5, sizeof(hdr_v5)) < 0)
+		return -1;
+	crc = htonl(crc);
+	if (ce_write(NULL, newfd, &crc, 4) < 0)
+		return -1;
+
+	conflict_offset = sizeof(hdr) + sizeof(hdr_v5) + 4
+		+ (ndir + 1) * 4
+		+ total_dir_len
+		+ ndir * (ondisk_directory_size + 4)
+		+ (non_conflicted + 1) * 4
+		+ total_file_len
+		+ non_conflicted * (sizeof(struct ondisk_cache_entry) + 4);
+	if (write_directories(de, newfd, conflict_offset) < 0)
+		return -1;
+	offset_to_offset = sizeof(hdr) + sizeof(hdr_v5) + 4
+		+ (ndir + 1) * 4
+		+ total_dir_len
+		+ ndir * (ondisk_directory_size + 4);
+	if (write_entries(istate, de, entries, newfd, offset_to_offset) < 0)
+		return -1;
+	if (write_conflicts(istate, de, newfd) < 0)
+		return -1;
+	return ce_flush(newfd);
+}
+
 struct index_ops v5_ops = {
 	match_stat_basic,
 	verify_hdr,
 	read_index_v5,
-	NULL,
+	write_index_v5,
 	index_change_filter_opts_v5
 };
diff --git a/read-cache.c b/read-cache.c
index 5ec0222..33f5ba5 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -108,7 +108,7 @@ int match_stat_data(const struct stat_data *sd, struct stat *st)
 	return changed;
 }
 
-static uint32_t calculate_stat_crc(struct cache_entry *ce)
+uint32_t calculate_stat_crc(struct cache_entry *ce)
 {
 	unsigned int ctimens = 0;
 	uint32_t stat, stat_crc;
@@ -232,6 +232,8 @@ static void set_istate_ops(struct index_state *istate)
 
 	if (istate->version >= 2 && istate->version <= 4)
 		istate->ops = &v2_ops;
+	if (istate->version == 5)
+		istate->ops = &v5_ops;
 }
 
 int ce_match_stat_basic(struct index_state *istate,
@@ -1311,7 +1313,12 @@ static int verify_hdr_version(struct index_state *istate,
 	hdr_version = ntohl(hdr->hdr_version);
 	if (hdr_version < INDEX_FORMAT_LB || INDEX_FORMAT_UB < hdr_version)
 		return error("bad index version %d", hdr_version);
-	istate->ops = &v2_ops;
+
+	if (hdr_version >= 2 && hdr_version <= 4)
+		istate->ops = &v2_ops;
+	else if (hdr_version == 5)
+		istate->ops = &v5_ops;
+
 	return 0;
 }
 
diff --git a/read-cache.h b/read-cache.h
index fe53c8e..f392152 100644
--- a/read-cache.h
+++ b/read-cache.h
@@ -66,3 +66,4 @@ extern int ce_match_stat_basic(struct index_state *istate, const struct cache_en
 			       struct stat *st);
 extern int is_racy_timestamp(const struct index_state *istate, const struct cache_entry *ce);
 extern void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce);
+extern uint32_t calculate_stat_crc(struct cache_entry *ce);
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 19/22] read-cache: write index-v5 cache-tree data
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (17 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 18/22] read-cache: write index-v5 Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 20/22] read-cache: write resolve-undo data for index-v5 Thomas Gummerer
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Write the cache-tree data for the index version 5 file format. The
in-memory cache-tree data is converted to the ondisk format, by adding
it to the directory entries, that were compiled from the cache-entries
in the step before.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 read-cache-v5.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/read-cache-v5.c b/read-cache-v5.c
index f056f6b..306de30 100644
--- a/read-cache-v5.c
+++ b/read-cache-v5.c
@@ -960,6 +960,57 @@ static struct conflict_entry *create_conflict_entry_from_ce(struct cache_entry *
 	return create_new_conflict(ce->name, ce_namelen(ce), pathlen);
 }
 
+static void convert_one_to_ondisk_v5(struct hash_table *table, struct cache_tree *it,
+				const char *path, int pathlen, uint32_t crc)
+{
+	int i;
+	struct directory_entry *found, *search;
+
+	crc = crc32(crc, (Bytef*)path, pathlen);
+	found = lookup_hash(crc, table);
+	search = found;
+	while (search && strcmp(path, search->pathname + search->de_pathlen - strlen(path)) != 0)
+		search = search->next_hash;
+	if (!search)
+		return;
+	/*
+	 * The number of subtrees is already calculated by
+	 * compile_directory_data, therefore we only need to
+	 * add the entry_count
+	 */
+	search->de_nentries = it->entry_count;
+	if (0 <= it->entry_count)
+		hashcpy(search->sha1, it->sha1);
+	if (strcmp(path, "") != 0)
+		crc = crc32(crc, (Bytef*)"/", 1);
+
+#if DEBUG
+	if (0 <= it->entry_count)
+		fprintf(stderr, "cache-tree <%.*s> (%d ent, %d subtree) %s\n",
+			pathlen, path, it->entry_count, it->subtree_nr,
+			sha1_to_hex(it->sha1));
+	else
+		fprintf(stderr, "cache-tree <%.*s> (%d subtree) invalid\n",
+			pathlen, path, it->subtree_nr);
+#endif
+
+	for (i = 0; i < it->subtree_nr; i++) {
+		struct cache_tree_sub *down = it->down[i];
+		if (i) {
+			struct cache_tree_sub *prev = it->down[i-1];
+			if (subtree_name_cmp(down->name, down->namelen,
+					     prev->name, prev->namelen) <= 0)
+				die("fatal - unsorted cache subtree");
+		}
+		convert_one_to_ondisk_v5(table, down->cache_tree, down->name, down->namelen, crc);
+	}
+}
+
+static void cache_tree_to_ondisk_v5(struct hash_table *table, struct cache_tree *root)
+{
+	convert_one_to_ondisk_v5(table, root, "", 0, 0);
+}
+
 static struct directory_entry *compile_directory_data(struct index_state *istate,
 						int nfile,
 						unsigned int *ndir,
@@ -1065,6 +1116,8 @@ static struct directory_entry *compile_directory_data(struct index_state *istate
 			previous_entry->next = no_subtrees;
 		}
 	}
+	if (istate->cache_tree)
+		cache_tree_to_ondisk_v5(&table, istate->cache_tree);
 	return de;
 }
 
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 20/22] read-cache: write resolve-undo data for index-v5
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (18 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 19/22] read-cache: write index-v5 cache-tree data Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:11 ` [PATCH 21/22] update-index.c: rewrite index when index-version is given Thomas Gummerer
  2013-07-07  8:12 ` [PATCH 22/22] p0003-index.sh: add perf test for the index formats Thomas Gummerer
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Make git read the resolve-undo data from the index.

Since the resolve-undo data is joined with the conflicts in
the ondisk format of the index file version 5, conflicts and
resolved data is read at the same time, and the resolve-undo
data is then converted to the in-memory format.

Helped-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 read-cache-v5.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)

diff --git a/read-cache-v5.c b/read-cache-v5.c
index 306de30..412db53 100644
--- a/read-cache-v5.c
+++ b/read-cache-v5.c
@@ -1011,6 +1011,99 @@ static void cache_tree_to_ondisk_v5(struct hash_table *table, struct cache_tree
 	convert_one_to_ondisk_v5(table, root, "", 0, 0);
 }
 
+static void resolve_undo_to_ondisk_v5(struct hash_table *table,
+				      struct string_list *resolve_undo,
+				      unsigned int *ndir, int *total_dir_len,
+				      struct directory_entry *de)
+{
+	struct string_list_item *item;
+	struct directory_entry *search;
+
+	if (!resolve_undo)
+		return;
+	for_each_string_list_item(item, resolve_undo) {
+		struct conflict_entry *conflict_entry;
+		struct resolve_undo_info *ui = item->util;
+		char *super;
+		int i, dir_len, len;
+		uint32_t crc;
+		struct directory_entry *found, *current, *new_tree;
+
+		if (!ui)
+			continue;
+
+		super = super_directory(item->string);
+		if (!super)
+			dir_len = 0;
+		else
+			dir_len = strlen(super);
+		crc = crc32(0, (Bytef*)super, dir_len);
+		found = lookup_hash(crc, table);
+		current = NULL;
+		new_tree = NULL;
+
+		while (!found) {
+			struct directory_entry *new;
+
+			new = init_directory_entry(super, dir_len);
+			if (!current)
+				current = new;
+			insert_directory_entry(new, table, total_dir_len, ndir, crc);
+			if (new_tree != NULL)
+				new->de_nsubtrees = 1;
+			new->next = new_tree;
+			new_tree = new;
+			super = super_directory(super);
+			if (!super)
+				dir_len = 0;
+			else
+				dir_len = strlen(super);
+			crc = crc32(0, (Bytef*)super, dir_len);
+			found = lookup_hash(crc, table);
+		}
+		search = found;
+		while (search->next_hash && strcmp(super, search->pathname) != 0)
+			search = search->next_hash;
+		if (search && !current)
+			current = search;
+		if (!search && !current)
+			current = new_tree;
+		if (!super && new_tree) {
+			new_tree->next = de->next;
+			de->next = new_tree;
+			de->de_nsubtrees++;
+		} else if (new_tree) {
+			struct directory_entry *temp;
+
+			search = de->next;
+			while (strcmp(super, search->pathname))
+				search = search->next;
+			temp = new_tree;
+			while (temp->next)
+				temp = temp->next;
+			search->de_nsubtrees++;
+			temp->next = search->next;
+			search->next = new_tree;
+		}
+
+		len = strlen(item->string);
+		conflict_entry = create_new_conflict(item->string, len, current->de_pathlen);
+		add_conflict_to_directory_entry(current, conflict_entry);
+		for (i = 0; i < 3; i++) {
+			if (ui->mode[i]) {
+				struct conflict_part *cp;
+
+				cp = xmalloc(sizeof(struct conflict_part));
+				cp->flags = (i + 1) << CONFLICT_STAGESHIFT;
+				cp->entry_mode = ui->mode[i];
+				cp->next = NULL;
+				hashcpy(cp->sha1, ui->sha1[i]);
+				add_part_to_conflict_entry(current, conflict_entry, cp);
+			}
+		}
+	}
+}
+
 static struct directory_entry *compile_directory_data(struct index_state *istate,
 						int nfile,
 						unsigned int *ndir,
@@ -1118,6 +1211,7 @@ static struct directory_entry *compile_directory_data(struct index_state *istate
 	}
 	if (istate->cache_tree)
 		cache_tree_to_ondisk_v5(&table, istate->cache_tree);
+	resolve_undo_to_ondisk_v5(&table, istate->resolve_undo, ndir, total_dir_len, de);
 	return de;
 }
 
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 21/22] update-index.c: rewrite index when index-version is given
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (19 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 20/22] read-cache: write resolve-undo data for index-v5 Thomas Gummerer
@ 2013-07-07  8:11 ` Thomas Gummerer
  2013-07-07  8:12 ` [PATCH 22/22] p0003-index.sh: add perf test for the index formats Thomas Gummerer
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:11 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

Make update-index always rewrite the index when a index-version
is given, even if the index already has the right version.
This option is used for performance testing the writer and
reader.

Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 builtin/update-index.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/builtin/update-index.c b/builtin/update-index.c
index 03f6426..7954ddb 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -6,6 +6,7 @@
 #include "cache.h"
 #include "quote.h"
 #include "cache-tree.h"
+#include "read-cache.h"
 #include "tree-walk.h"
 #include "builtin.h"
 #include "refs.h"
@@ -863,8 +864,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
 			    preferred_index_format,
 			    INDEX_FORMAT_LB, INDEX_FORMAT_UB);
 
-		if (the_index.version != preferred_index_format)
-			active_cache_changed = 1;
+		active_cache_changed = 1;
 		the_index.version = preferred_index_format;
 	}
 
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 22/22] p0003-index.sh: add perf test for the index formats
  2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
                   ` (20 preceding siblings ...)
  2013-07-07  8:11 ` [PATCH 21/22] update-index.c: rewrite index when index-version is given Thomas Gummerer
@ 2013-07-07  8:12 ` Thomas Gummerer
  21 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-07  8:12 UTC (permalink / raw)
  To: git; +Cc: trast, mhagger, gitster, pclouds, robin.rosenberg, t.gummerer

From: Thomas Rast <trast@inf.ethz.ch>

Add a performance test for index version [23]/4/5 by using
git update-index --index-version=x, thus testing both the reader
and the writer speed of all index formats.

Signed-off-by: Thomas Rast <trast@inf.ethz.ch>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---
 t/perf/p0003-index.sh | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)
 create mode 100755 t/perf/p0003-index.sh

diff --git a/t/perf/p0003-index.sh b/t/perf/p0003-index.sh
new file mode 100755
index 0000000..3e02868
--- /dev/null
+++ b/t/perf/p0003-index.sh
@@ -0,0 +1,59 @@
+#!/bin/sh
+
+test_description="Tests index versions [23]/4/5"
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+test_expect_success "convert to v3" "
+	git update-index --index-version=2
+"
+
+test_perf "v[23]: update-index" "
+	git update-index --index-version=2 >/dev/null
+"
+
+subdir=$(git ls-files | sed 's#/[^/]*$##' | grep -v '^$' | uniq | tail -n 30 | head -1)
+
+test_perf "v[23]: grep nonexistent -- subdir" "
+	test_must_fail git grep nonexistent -- $subdir >/dev/null
+"
+
+test_perf "v[23]: ls-files -- subdir" "
+	git ls-files $subdir >/dev/null
+"
+
+test_expect_success "convert to v4" "
+	git update-index --index-version=4
+"
+
+test_perf "v4: update-index" "
+	git update-index --index-version=4 >/dev/null
+"
+
+test_perf "v4: grep nonexistent -- subdir" "
+	test_must_fail git grep nonexistent -- $subdir >/dev/null
+"
+
+test_perf "v4: ls-files -- subdir" "
+	git ls-files $subdir >/dev/null
+"
+
+test_expect_success "convert to v5" "
+	git update-index --index-version=5
+"
+
+test_perf "v5: update-index" "
+	git update-index --index-version=5 >/dev/null
+"
+
+test_perf "v5: grep nonexistent -- subdir" "
+	test_must_fail git grep nonexistent -- $subdir >/dev/null
+"
+
+test_perf "v5: ls-files -- subdir" "
+	git ls-files $subdir >/dev/null
+"
+
+test_done
-- 
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 15/22] read-cache: read index-v5
  2013-07-07  8:11 ` [PATCH 15/22] read-cache: read index-v5 Thomas Gummerer
@ 2013-07-07 20:18   ` Eric Sunshine
  2013-07-08 11:40     ` Thomas Gummerer
  0 siblings, 1 reply; 51+ messages in thread
From: Eric Sunshine @ 2013-07-07 20:18 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, robin.rosenberg

On Sun, Jul 7, 2013 at 4:11 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> Make git read the index file version 5 without complaining.
>
> This version of the reader doesn't read neither the cache-tree
> nor the resolve undo data, but doesn't choke on an index that
> includes such data.
> ---
> diff --git a/read-cache-v5.c b/read-cache-v5.c
> new file mode 100644
> index 0000000..e319f30
> --- /dev/null
> +++ b/read-cache-v5.c
> @@ -0,0 +1,658 @@
> +static struct directory_entry *read_directories(unsigned int *dir_offset,
> +                               unsigned int *dir_table_offset,
> +                               void *mmap,
> +                               int mmap_size)
> +{
> +       int i, ondisk_directory_size;
> +       uint32_t *filecrc, *beginning, *end;
> +       struct directory_entry *current = NULL;
> +       struct ondisk_directory_entry *disk_de;
> +       struct directory_entry *de;
> +       unsigned int data_len, len;
> +       char *name;
> +
> +       /* Length of pathname + nul byte for termination + size of
> +        * members of ondisk_directory_entry. (Just using the size
> +        * of the stuct doesn't work, because there may be padding

s/stuct/struct/

> +        * bytes for the struct)
> +        */

Also:

  /*
   * Format multi-line comment
   * like this.
   */

Remaining multi-line comments appear to be formatted correctly.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 17/22] read-cache: read cache-tree in index-v5
  2013-07-07  8:11 ` [PATCH 17/22] read-cache: read cache-tree in index-v5 Thomas Gummerer
@ 2013-07-07 20:41   ` Eric Sunshine
  0 siblings, 0 replies; 51+ messages in thread
From: Eric Sunshine @ 2013-07-07 20:41 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, robin.rosenberg

On Sun, Jul 7, 2013 at 4:11 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> Since the cache-tree data is saved as part of the directory data,
> we already read it at the beginning of the index. The cache-tree
> is only converted from this directory data.
>
> The cache-tree data is arranged in a tree, with the children sorted by
> pathlen at each node, while the ondisk format is sorted lexically.
> So we have to rebuild this format from the on-disk directory list.
>
> Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
> ---
> diff --git a/read-cache-v5.c b/read-cache-v5.c
> index 193970a..f1ad132 100644
> --- a/read-cache-v5.c
> +++ b/read-cache-v5.c
> @@ -448,6 +448,103 @@ static int read_conflicts(struct conflict_entry **head,
>         return 0;
>  }
>
> +/*
> + * This function modifys the directory argument that is given to it.

s/modifys/modifies/

> + * Don't use it if the directory entries are still needed after.
> + */
> +static struct cache_tree *cache_tree_convert_v5(struct directory_entry *de)
> +{
> +       struct directory_queue *queue;
> +
> +       if (!de->de_nentries)
> +               return NULL;
> +       queue = xcalloc(1, sizeof(struct directory_queue));
> +       queue[0].de = de;
> +       queue[0].down = xcalloc(de->de_nsubtrees, sizeof(struct directory_queue));
> +
> +       sort_directories(de, queue[0].down);
> +       return convert_one(queue, 0);
> +}
> +
>  static void resolve_undo_convert_v5(struct index_state *istate,
>                                     struct conflict_entry *conflict)
>  {

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 18/22] read-cache: write index-v5
  2013-07-07  8:11 ` [PATCH 18/22] read-cache: write index-v5 Thomas Gummerer
@ 2013-07-07 20:43   ` Eric Sunshine
  0 siblings, 0 replies; 51+ messages in thread
From: Eric Sunshine @ 2013-07-07 20:43 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, Robin Rosenberg

On Sun, Jul 7, 2013 at 4:11 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> Write the index version 5 file format to disk. This version doesn't
> write the cache-tree data and resolve-undo data to the file.
>
> The main work is done when filtering out the directories from the
> current in-memory format, where in the same turn also the conflicts
> and the file data is calculated.
>
> Helped-by: Nguyen Thai Ngoc Duy <pclouds@gmail.com>
> Helped-by: Thomas Rast <trast@student.ethz.ch>
> Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
> ---
> diff --git a/read-cache-v5.c b/read-cache-v5.c
> index f1ad132..f056f6b 100644
> --- a/read-cache-v5.c
> +++ b/read-cache-v5.c
> +static int write_index_v5(struct index_state *istate, int newfd)
> +{
> +       struct cache_version_header hdr;
> +       struct cache_header hdr_v5;
> +       struct cache_entry **cache = istate->cache;
> +       struct directory_entry *de;
> +       struct ondisk_directory_entry *ondisk;
> +       int entries = istate->cache_nr;
> +       int i, removed, non_conflicted, total_dir_len, ondisk_directory_size;
> +       int total_file_len, conflict_offset, offset_to_offset;
> +       unsigned int ndir;
> +       uint32_t crc;
> +
> +       if (istate->partially_read)
> +               die("BUG: index: cannot write a partially read index");
> +
> +       for (i = removed = 0; i < entries; i++) {
> +               if (cache[i]->ce_flags & CE_REMOVE)
> +                       removed++;
> +       }
> +       hdr.hdr_signature = htonl(CACHE_SIGNATURE);
> +       hdr.hdr_version = htonl(istate->version);
> +       hdr_v5.hdr_nfile = htonl(entries - removed);
> +       hdr_v5.hdr_nextension = htonl(0); /* Currently no extensions are supported */
> +
> +       non_conflicted = 0;
> +       total_dir_len = 0;
> +       total_file_len = 0;
> +       de = compile_directory_data(istate, entries, &ndir, &non_conflicted,
> +                       &total_dir_len, &total_file_len);
> +       hdr_v5.hdr_ndir = htonl(ndir);
> +
> +       /*
> +        * This is needed because the compiler aligns structs to sizes multipe

s/multipe/multiple/

> +        * of 4
> +        */
> +       ondisk_directory_size = sizeof(ondisk->flags)
> +               + sizeof(ondisk->foffset)
> +               + sizeof(ondisk->cr)
> +               + sizeof(ondisk->ncr)
> +               + sizeof(ondisk->nsubtrees)
> +               + sizeof(ondisk->nfiles)
> +               + sizeof(ondisk->nentries)
> +               + sizeof(ondisk->sha1);
> +       hdr_v5.hdr_fblockoffset = htonl(sizeof(hdr) + sizeof(hdr_v5) + 4
> +               + (ndir + 1) * 4
> +               + total_dir_len
> +               + ndir * (ondisk_directory_size + 4)
> +               + (non_conflicted + 1) * 4);
> +
> +       crc = 0;
> +       if (ce_write(&crc, newfd, &hdr, sizeof(hdr)) < 0)
> +               return -1;
> +       if (ce_write(&crc, newfd, &hdr_v5, sizeof(hdr_v5)) < 0)
> +               return -1;
> +       crc = htonl(crc);
> +       if (ce_write(NULL, newfd, &crc, 4) < 0)
> +               return -1;
> +
> +       conflict_offset = sizeof(hdr) + sizeof(hdr_v5) + 4
> +               + (ndir + 1) * 4
> +               + total_dir_len
> +               + ndir * (ondisk_directory_size + 4)
> +               + (non_conflicted + 1) * 4
> +               + total_file_len
> +               + non_conflicted * (sizeof(struct ondisk_cache_entry) + 4);
> +       if (write_directories(de, newfd, conflict_offset) < 0)
> +               return -1;
> +       offset_to_offset = sizeof(hdr) + sizeof(hdr_v5) + 4
> +               + (ndir + 1) * 4
> +               + total_dir_len
> +               + ndir * (ondisk_directory_size + 4);
> +       if (write_entries(istate, de, entries, newfd, offset_to_offset) < 0)
> +               return -1;
> +       if (write_conflicts(istate, de, newfd) < 0)
> +               return -1;
> +       return ce_flush(newfd);
> +}
> +

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/22] read-cache: add index reading api
  2013-07-07  8:11 ` [PATCH 05/22] read-cache: add index reading api Thomas Gummerer
@ 2013-07-08  2:01   ` Duy Nguyen
  2013-07-08 11:40     ` Thomas Gummerer
  2013-07-08  2:19   ` Duy Nguyen
  2013-07-08 16:36   ` [PATCH 05/22] read-cache: add index reading api Junio C Hamano
  2 siblings, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-08  2:01 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

On Sun, Jul 7, 2013 at 3:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> Add an api for access to the index file.  Currently there is only a very
> basic api for accessing the index file, which only allows a full read of
> the index, and lets the users of the data filter it.  The new index api
> gives the users the possibility to use only part of the index and
> provides functions for iterating over and accessing cache entries.
>
> This simplifies future improvements to the in-memory format, as changes
> will be concentrated on one file, instead of the whole git source code.
>
> Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
> ---
>  cache.h         |  57 +++++++++++++++++++++++++++++-
>  read-cache-v2.c |  96 +++++++++++++++++++++++++++++++++++++++++++++++--
>  read-cache.c    | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  read-cache.h    |  12 ++++++-
>  4 files changed, 263 insertions(+), 10 deletions(-)
>
> diff --git a/cache.h b/cache.h
> index 5082b34..d38dfbd 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -127,7 +127,8 @@ struct cache_entry {
>         unsigned int ce_flags;
>         unsigned int ce_namelen;
>         unsigned char sha1[20];
> -       struct cache_entry *next;
> +       struct cache_entry *next; /* used by name_hash */
> +       struct cache_entry *next_ce; /* used to keep a list of cache entries */
>         char name[FLEX_ARRAY]; /* more */
>  };

>From what I read, doing

    ce = start;
    while (ce) { do(something); ce = next_cache_entry(ce); }

is the same as

    i = start_index;
    while (i < active_nr) { ce = active_cache[i]; do(something); i++; }

What's the advantage of using the former over the latter? Do you plan
to eliminate the latter loop (by hiding "struct cache_entry **cache;"
from public index_state structure?
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/22] read-cache: add index reading api
  2013-07-07  8:11 ` [PATCH 05/22] read-cache: add index reading api Thomas Gummerer
  2013-07-08  2:01   ` Duy Nguyen
@ 2013-07-08  2:19   ` Duy Nguyen
  2013-07-08 11:20     ` Thomas Gummerer
  2013-07-08 16:36   ` [PATCH 05/22] read-cache: add index reading api Junio C Hamano
  2 siblings, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-08  2:19 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

On Sun, Jul 7, 2013 at 3:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> +/*
> + * Options by which the index should be filtered when read partially.
> + *
> + * pathspec: The pathspec which the index entries have to match
> + * seen: Used to return the seen parameter from match_pathspec()
> + * max_prefix, max_prefix_len: These variables are set to the longest
> + *     common prefix, the length of the longest common prefix of the
> + *     given pathspec
> + *
> + * read_staged: used to indicate if the conflicted entries (entries
> + *     with a stage) should be included
> + * read_cache_tree: used to indicate if the cache-tree should be read
> + * read_resolve_undo: used to indicate if the resolve undo data should
> + *     be read
> + */
> +struct filter_opts {
> +       const char **pathspec;
> +       char *seen;
> +       char *max_prefix;
> +       int max_prefix_len;
> +
> +       int read_staged;
> +       int read_cache_tree;
> +       int read_resolve_undo;
> +};
> +
>  struct index_state {
>         struct cache_entry **cache;
>         unsigned int version;
> @@ -270,6 +297,8 @@ struct index_state {
>         struct hash_table name_hash;
>         struct hash_table dir_hash;
>         struct index_ops *ops;
> +       struct internal_ops *internal_ops;
> +       struct filter_opts *filter_opts;
>  };

...

> -/* remember to discard_cache() before reading a different cache! */
> -int read_index_from(struct index_state *istate, const char *path)
> +
> +int read_index_filtered_from(struct index_state *istate, const char *path,
> +                            struct filter_opts *opts)
>  {
>         int fd, err, i;
>         struct stat st_old, st_new;
> @@ -1337,7 +1425,7 @@ int read_index_from(struct index_state *istate, const char *path)
>                 if (istate->ops->verify_hdr(mmap, mmap_size) < 0)
>                         err = 1;
>
> -               if (istate->ops->read_index(istate, mmap, mmap_size) < 0)
> +               if (istate->ops->read_index(istate, mmap, mmap_size, opts) < 0)
>                         err = 1;
>                 istate->timestamp.sec = st_old.st_mtime;
>                 istate->timestamp.nsec = ST_MTIME_NSEC(st_old);
> @@ -1345,6 +1433,7 @@ int read_index_from(struct index_state *istate, const char *path)
>                         die_errno("cannot stat the open index");
>
>                 munmap(mmap, mmap_size);
> +               istate->filter_opts = opts;
>                 if (!index_changed(&st_old, &st_new) && !err)
>                         return istate->cache_nr;
>         }

Putting filter_opts in index_state feels like a bad design. Iterator
information should be separated from the iterated object, so that two
callers can walk through the same index without stepping on each other
(I'm not talking about multithreading, a caller may walk a bit, then
the other caller starts walking, then the former caller resumes
walking again in a call chain).
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/22] read-cache: add index reading api
  2013-07-08  2:19   ` Duy Nguyen
@ 2013-07-08 11:20     ` Thomas Gummerer
  2013-07-08 12:45       ` Duy Nguyen
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-08 11:20 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

Duy Nguyen <pclouds@gmail.com> writes:

> On Sun, Jul 7, 2013 at 3:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> +/*
>> + * Options by which the index should be filtered when read partially.
>> + *
>> + * pathspec: The pathspec which the index entries have to match
>> + * seen: Used to return the seen parameter from match_pathspec()
>> + * max_prefix, max_prefix_len: These variables are set to the longest
>> + *     common prefix, the length of the longest common prefix of the
>> + *     given pathspec
>> + *
>> + * read_staged: used to indicate if the conflicted entries (entries
>> + *     with a stage) should be included
>> + * read_cache_tree: used to indicate if the cache-tree should be read
>> + * read_resolve_undo: used to indicate if the resolve undo data should
>> + *     be read
>> + */
>> +struct filter_opts {
>> +       const char **pathspec;
>> +       char *seen;
>> +       char *max_prefix;
>> +       int max_prefix_len;
>> +
>> +       int read_staged;
>> +       int read_cache_tree;
>> +       int read_resolve_undo;
>> +};
>> +
>>  struct index_state {
>>         struct cache_entry **cache;
>>         unsigned int version;
>> @@ -270,6 +297,8 @@ struct index_state {
>>         struct hash_table name_hash;
>>         struct hash_table dir_hash;
>>         struct index_ops *ops;
>> +       struct internal_ops *internal_ops;
>> +       struct filter_opts *filter_opts;
>>  };
>
> ...
>
>> -/* remember to discard_cache() before reading a different cache! */
>> -int read_index_from(struct index_state *istate, const char *path)
>> +
>> +int read_index_filtered_from(struct index_state *istate, const char *path,
>> +                            struct filter_opts *opts)
>>  {
>>         int fd, err, i;
>>         struct stat st_old, st_new;
>> @@ -1337,7 +1425,7 @@ int read_index_from(struct index_state *istate, const char *path)
>>                 if (istate->ops->verify_hdr(mmap, mmap_size) < 0)
>>                         err = 1;
>>
>> -               if (istate->ops->read_index(istate, mmap, mmap_size) < 0)
>> +               if (istate->ops->read_index(istate, mmap, mmap_size, opts) < 0)
>>                         err = 1;
>>                 istate->timestamp.sec = st_old.st_mtime;
>>                 istate->timestamp.nsec = ST_MTIME_NSEC(st_old);
>> @@ -1345,6 +1433,7 @@ int read_index_from(struct index_state *istate, const char *path)
>>                         die_errno("cannot stat the open index");
>>
>>                 munmap(mmap, mmap_size);
>> +               istate->filter_opts = opts;
>>                 if (!index_changed(&st_old, &st_new) && !err)
>>                         return istate->cache_nr;
>>         }
>
> Putting filter_opts in index_state feels like a bad design. Iterator
> information should be separated from the iterated object, so that two
> callers can walk through the same index without stepping on each other
> (I'm not talking about multithreading, a caller may walk a bit, then
> the other caller starts walking, then the former caller resumes
> walking again in a call chain).

Yes, you're right.  We need the filter_opts to see what part of the
index has been loaded [1] and which part has been skipped, but it
shouldn't be used for filtering in the for_each_index_entry function.

I think there should be two versions of the for_each_index_entry
function then, where the for_each_index_entry function would behave the
same way as the for_each_index_entry_filtered function with the
filter_opts parameter set to NULL:
for_each_index_entry_filtered(struct index_state *, each_cache_entry_fn, void *cb_data, struct filter_opts *)
for_each_index_entry(struct index_state *, each_cache_entry_fn, void *cb_data)

Both of them then should call index_change_filter_opts to make sure all
the entries that are needed are loaded in the in-memory format.

Does that make sense?

[1] That is only important for the new index-v5 file format, which can
    be loaded partially.  The filter_opts could always be set to NULL,
    as the whole index is always loaded anyway.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/22] read-cache: add index reading api
  2013-07-08  2:01   ` Duy Nguyen
@ 2013-07-08 11:40     ` Thomas Gummerer
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-08 11:40 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

Duy Nguyen <pclouds@gmail.com> writes:

> On Sun, Jul 7, 2013 at 3:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> Add an api for access to the index file.  Currently there is only a very
>> basic api for accessing the index file, which only allows a full read of
>> the index, and lets the users of the data filter it.  The new index api
>> gives the users the possibility to use only part of the index and
>> provides functions for iterating over and accessing cache entries.
>>
>> This simplifies future improvements to the in-memory format, as changes
>> will be concentrated on one file, instead of the whole git source code.
>>
>> Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
>> ---
>>  cache.h         |  57 +++++++++++++++++++++++++++++-
>>  read-cache-v2.c |  96 +++++++++++++++++++++++++++++++++++++++++++++++--
>>  read-cache.c    | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>  read-cache.h    |  12 ++++++-
>>  4 files changed, 263 insertions(+), 10 deletions(-)
>>
>> diff --git a/cache.h b/cache.h
>> index 5082b34..d38dfbd 100644
>> --- a/cache.h
>> +++ b/cache.h
>> @@ -127,7 +127,8 @@ struct cache_entry {
>>         unsigned int ce_flags;
>>         unsigned int ce_namelen;
>>         unsigned char sha1[20];
>> -       struct cache_entry *next;
>> +       struct cache_entry *next; /* used by name_hash */
>> +       struct cache_entry *next_ce; /* used to keep a list of cache entries */
>>         char name[FLEX_ARRAY]; /* more */
>>  };
>
> From what I read, doing
>
>     ce = start;
>     while (ce) { do(something); ce = next_cache_entry(ce); }
>
> is the same as
>
>     i = start_index;
>     while (i < active_nr) { ce = active_cache[i]; do(something); i++; }
>
> What's the advantage of using the former over the latter? Do you plan
> to eliminate the latter loop (by hiding "struct cache_entry **cache;"
> from public index_state structure?

Yes, I wanted to eliminate the latter loop, because it depends on the
in-memory format of the index.  By moving all direct accesses of
index_state->cache to an api it gets easier to change the in-memory
format.  I played a bit with a tree-based in-memory format [1], which
represents the on-disk format of index-v5 more closely, making
modifications and partial-loading simpler.

I've tried switching all those loops to api calls, but that would make
the api too bloated because there is a lot of those loops.  I found it
more sensible to do it this way, leaving the loops how they are, while
making future changes to the in-memory format a lot simpler.

[1] https://github.com/tgummerer/git/blob/index-v5api/read-cache-v5.c#L17

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 15/22] read-cache: read index-v5
  2013-07-07 20:18   ` Eric Sunshine
@ 2013-07-08 11:40     ` Thomas Gummerer
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-08 11:40 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Git List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, robin.rosenberg

Eric Sunshine <sunshine@sunshineco.com> writes:

> On Sun, Jul 7, 2013 at 4:11 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> Make git read the index file version 5 without complaining.
>>
>> This version of the reader doesn't read neither the cache-tree
>> nor the resolve undo data, but doesn't choke on an index that
>> includes such data.
>> ---
>> diff --git a/read-cache-v5.c b/read-cache-v5.c
>> new file mode 100644
>> index 0000000..e319f30
>> --- /dev/null
>> +++ b/read-cache-v5.c
>> @@ -0,0 +1,658 @@
>> +static struct directory_entry *read_directories(unsigned int *dir_offset,
>> +                               unsigned int *dir_table_offset,
>> +                               void *mmap,
>> +                               int mmap_size)
>> +{
>> +       int i, ondisk_directory_size;
>> +       uint32_t *filecrc, *beginning, *end;
>> +       struct directory_entry *current = NULL;
>> +       struct ondisk_directory_entry *disk_de;
>> +       struct directory_entry *de;
>> +       unsigned int data_len, len;
>> +       char *name;
>> +
>> +       /* Length of pathname + nul byte for termination + size of
>> +        * members of ondisk_directory_entry. (Just using the size
>> +        * of the stuct doesn't work, because there may be padding
>
> s/stuct/struct/
>
>> +        * bytes for the struct)
>> +        */
>
> Also:
>
>   /*
>    * Format multi-line comment
>    * like this.
>    */
>
> Remaining multi-line comments appear to be formatted correctly.

Thanks for catching this and the other typos.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/22] read-cache: add index reading api
  2013-07-08 11:20     ` Thomas Gummerer
@ 2013-07-08 12:45       ` Duy Nguyen
  2013-07-08 13:37         ` Thomas Gummerer
  2013-07-08 20:54         ` [PATCH 5.5/22] Add documentation for the index api Thomas Gummerer
  0 siblings, 2 replies; 51+ messages in thread
From: Duy Nguyen @ 2013-07-08 12:45 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

On Mon, Jul 8, 2013 at 6:20 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> Duy Nguyen <pclouds@gmail.com> writes:
>> Putting filter_opts in index_state feels like a bad design. Iterator
>> information should be separated from the iterated object, so that two
>> callers can walk through the same index without stepping on each other
>> (I'm not talking about multithreading, a caller may walk a bit, then
>> the other caller starts walking, then the former caller resumes
>> walking again in a call chain).
>
> Yes, you're right.  We need the filter_opts to see what part of the
> index has been loaded [1] and which part has been skipped, but it
> shouldn't be used for filtering in the for_each_index_entry function.
>
> I think there should be two versions of the for_each_index_entry
> function then, where the for_each_index_entry function would behave the
> same way as the for_each_index_entry_filtered function with the
> filter_opts parameter set to NULL:
> for_each_index_entry_filtered(struct index_state *, each_cache_entry_fn, void *cb_data, struct filter_opts *)
> for_each_index_entry(struct index_state *, each_cache_entry_fn, void *cb_data)
>
> Both of them then should call index_change_filter_opts to make sure all
> the entries that are needed are loaded in the in-memory format.
>
> Does that make sense?

Hmm.. I was confused actually (documentation on the api would help
greatly). If you already filter at load time, I don't think you need
to match again. The caller asked for filter and it should know what's
in there so for_each_index_entry just goes through all entries without
extra match_pathspec. Or is that what next_index_entry for?
match_pathspec function could be expensive when glob is involved. If
the caller wants extra matching, it could do inside the callback
function.

It seems you could change the filter with index_change_filter_opts. In
v5 the index will be reloaded. What happens when some index entries
area already modified? Do we start to have read-only index "views" and
one read-write view? If partial views are always read-only, perhaps we
just allow the user to create a new index_state (or view) with new
filter and destroy the old one. We don't have to care about changing
or separating filter in that case because the view is the iterator.

I wanted to have a tree-based iterator api, but that seems
incompatible with pre-v5 (or at least adds some overhead on pre-v5 to
rebuild the tree structure). It looks like using pathspec to build a
list of entries, as you did, is a good way to take advantage of
tree-based v5 while maintaining code compatibility with pre-v5. By the
way with tree structure, you could use tree_entry_interesting in
read_index_filtered_v5. I think it's more efficient than
match_pathspec.

I'm still studying the code. Some of what I wrote here may be totally
wrong due to my lack of understanding. I'll get back to you later if I
find something else.

> [1] That is only important for the new index-v5 file format, which can
>     be loaded partially.  The filter_opts could always be set to NULL,
>     as the whole index is always loaded anyway.
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/22] read-cache: add index reading api
  2013-07-08 12:45       ` Duy Nguyen
@ 2013-07-08 13:37         ` Thomas Gummerer
  2013-07-08 20:54         ` [PATCH 5.5/22] Add documentation for the index api Thomas Gummerer
  1 sibling, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-08 13:37 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

Duy Nguyen <pclouds@gmail.com> writes:

> On Mon, Jul 8, 2013 at 6:20 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> Duy Nguyen <pclouds@gmail.com> writes:
>>> Putting filter_opts in index_state feels like a bad design. Iterator
>>> information should be separated from the iterated object, so that two
>>> callers can walk through the same index without stepping on each other
>>> (I'm not talking about multithreading, a caller may walk a bit, then
>>> the other caller starts walking, then the former caller resumes
>>> walking again in a call chain).
>>
>> Yes, you're right.  We need the filter_opts to see what part of the
>> index has been loaded [1] and which part has been skipped, but it
>> shouldn't be used for filtering in the for_each_index_entry function.
>>
>> I think there should be two versions of the for_each_index_entry
>> function then, where the for_each_index_entry function would behave the
>> same way as the for_each_index_entry_filtered function with the
>> filter_opts parameter set to NULL:
>> for_each_index_entry_filtered(struct index_state *, each_cache_entry_fn, void *cb_data, struct filter_opts *)
>> for_each_index_entry(struct index_state *, each_cache_entry_fn, void *cb_data)
>>
>> Both of them then should call index_change_filter_opts to make sure all
>> the entries that are needed are loaded in the in-memory format.
>>
>> Does that make sense?
>
> Hmm.. I was confused actually (documentation on the api would help
> greatly). If you already filter at load time, I don't think you need
> to match again. The caller asked for filter and it should know what's
> in there so for_each_index_entry just goes through all entries without
> extra match_pathspec. Or is that what next_index_entry for?
> match_pathspec function could be expensive when glob is involved. If
> the caller wants extra matching, it could do inside the callback
> function.

Yes, a documentation would be good.  I'll try to write something better
later today, when I have some more time.  In the meantime I'll just
outline what the functions do here shortly:

read_index_filtered(opts): This method behaves differently for index-v2
  and index-v5.
  For index-v2 it simply reads the whole index as read_cache() does, so
  we are sure we don't have to reload anything if the user wants a
  different filter.
  For index-v5 it creates an adjusted pathspec to and reads all
  directories that are matched by them.

get_index_entry_by_name(name, namelen, &ce): Returns a cache_entry
  matched by name via the &ce parameter.  If a cache_entry is matched
  exactly 1 is returned.
  Name may also be a path, in which case it returns 0 and the first
  cache_entry in that path. e.g. we have:
      ...
      path/file1
      ....
    in the index and name is "path", than it returns 0 and the path/file1
    cache_entry.  If name is "path/file1" on the other hand it returns 1
    and the path/file1 cache_entry.

for_each_index_entry(fn, cb_data):  Iterates over all cache_entries in
  the index filtered by filter_opts in the index_state, and executes fn
  for each of them with the cb_data as callback data.

next_index_entry(ce): Returns the cache_entry that follows after ce

index_change_filter_opts(opts): For index-v2 it simply changes the
  filter_opts, so for_each_index_entry uses the changed index_opts.
  For index-v5 it refreshes the index if the filter_opts have changed.
  This has some optimization potential, in the case that the opts get
  stricter (less of the index should be read) it doesn't have to reload
  anything.

I'm not sure what's in the cache, because the whole index is in the
cache if the on-disk format is index-v2 and the index is filtered by the
adjusted_pathspec if the on-disk format is index-v5.  That's what I need
the extra match_pathspec for. But yes, that could also be left to the
caller.

Hope that makes it a little clearer.

> It seems you could change the filter with index_change_filter_opts. In
> v5 the index will be reloaded. What happens when some index entries
> area already modified? Do we start to have read-only index "views" and
> one read-write view? If partial views are always read-only, perhaps we
> just allow the user to create a new index_state (or view) with new
> filter and destroy the old one. We don't have to care about changing
> or separating filter in that case because the view is the iterator.

The read-write part is mostly covered by the next patch (6/22).  Before
changing the index, the filter_opts always have to be set to NULL, using
index_change_filter_opts and therefore use the whole index.  This is
currently hard to improve, because we always need the whole index when
we write it.  Changing this only makes sense once we have partial
writing too.

So in principle the index_change_filter_opts function implements those
views.

Even with partial writing we have to distinguish if a cache_entry has
been added/removed, in which case a full rewrite is necessary or if a
cache_entry has simply been modified (it's content changed), in which
case we could replace it in place.

> I wanted to have a tree-based iterator api, but that seems
> incompatible with pre-v5 (or at least adds some overhead on pre-v5 to
> rebuild the tree structure). It looks like using pathspec to build a
> list of entries, as you did, is a good way to take advantage of
> tree-based v5 while maintaining code compatibility with pre-v5. By the
> way with tree structure, you could use tree_entry_interesting in
> read_index_filtered_v5. I think it's more efficient than
> match_pathspec.

Yes, that's why I decided to keep the current in-memory format for now.
Once an api is in place I think it will be easier to explore the
tree-based format, without having to change the format all over the
place.

Thanks, I will take a look at tree_entry_interesting later.

> I'm still studying the code. Some of what I wrote here may be totally
> wrong due to my lack of understanding. I'll get back to you later if I
> find something else.
>
>> [1] That is only important for the new index-v5 file format, which can
>>     be loaded partially.  The filter_opts could always be set to NULL,
>>     as the whole index is always loaded anyway.
> --
> Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/22] make sure partially read index is not changed
  2013-07-07  8:11 ` [PATCH 06/22] make sure partially read index is not changed Thomas Gummerer
@ 2013-07-08 16:31   ` Junio C Hamano
  2013-07-08 18:33     ` Thomas Gummerer
  0 siblings, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2013-07-08 16:31 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, mhagger, pclouds, robin.rosenberg

Thomas Gummerer <t.gummerer@gmail.com> writes:

> A partially read index file currently cannot be written to disk.  Make
> sure that never happens, by re-reading the index file if the index file
> wasn't read completely before changing the in-memory index.

I am not quite sure what you are trying to do.  

In operations that modify the index (replace_index_entry(),
remove_index_entry_at(), etc.)  you lift the filter_ops and keep
partially_read flag still on.  In the write-out codepath, you have
an assert to make sure the caller has cleared the partially_read
flag.  A natural way to clear the flag is to re-read the index from
the file, but then you can easily lose the modifications.  Should
there be another safety that says "calling read_index() with the
partially_read flag on is a bug" or something?

Also shouldn't the flag be cleared upon discard_index()?  If it is
done there, you probably would not need to clear it in read_index().

>
> Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
> ---
>  builtin/update-index.c | 4 ++++
>  cache.h                | 4 +++-
>  read-cache-v2.c        | 3 +++
>  read-cache.c           | 8 ++++++++
>  4 files changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/update-index.c b/builtin/update-index.c
> index 5c7762e..03f6426 100644
> --- a/builtin/update-index.c
> +++ b/builtin/update-index.c
> @@ -49,6 +49,8 @@ static int mark_ce_flags(const char *path, int flag, int mark)
>  	int namelen = strlen(path);
>  	int pos = cache_name_pos(path, namelen);
>  	if (0 <= pos) {
> +		if (active_cache_partially_read)
> +			cache_change_filter_opts(NULL);
>  		if (mark)
>  			active_cache[pos]->ce_flags |= flag;
>  		else
> @@ -253,6 +255,8 @@ static void chmod_path(int flip, const char *path)
>  	pos = cache_name_pos(path, strlen(path));
>  	if (pos < 0)
>  		goto fail;
> +	if (active_cache_partially_read)
> +		cache_change_filter_opts(NULL);
>  	ce = active_cache[pos];
>  	mode = ce->ce_mode;
>  	if (!S_ISREG(mode))
> diff --git a/cache.h b/cache.h
> index d38dfbd..f6c3407 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -293,7 +293,8 @@ struct index_state {
>  	struct cache_tree *cache_tree;
>  	struct cache_time timestamp;
>  	unsigned name_hash_initialized : 1,
> -		 initialized : 1;
> +		 initialized : 1,
> +		 partially_read : 1;
>  	struct hash_table name_hash;
>  	struct hash_table dir_hash;
>  	struct index_ops *ops;
> @@ -315,6 +316,7 @@ extern void free_name_hash(struct index_state *istate);
>  #define active_alloc (the_index.cache_alloc)
>  #define active_cache_changed (the_index.cache_changed)
>  #define active_cache_tree (the_index.cache_tree)
> +#define active_cache_partially_read (the_index.partially_read)
>  
>  #define read_cache() read_index(&the_index)
>  #define read_cache_from(path) read_index_from(&the_index, (path))
> diff --git a/read-cache-v2.c b/read-cache-v2.c
> index 1ed640d..2cc792d 100644
> --- a/read-cache-v2.c
> +++ b/read-cache-v2.c
> @@ -273,6 +273,7 @@ static int read_index_v2(struct index_state *istate, void *mmap,
>  		src_offset += 8;
>  		src_offset += extsize;
>  	}
> +	istate->partially_read = 0;
>  	return 0;
>  unmap:
>  	munmap(mmap, mmap_size);
> @@ -495,6 +496,8 @@ static int write_index_v2(struct index_state *istate, int newfd)
>  	struct stat st;
>  	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
>  
> +	if (istate->partially_read)
> +		die("BUG: index: cannot write a partially read index");
>  	for (i = removed = extended = 0; i < entries; i++) {
>  		if (cache[i]->ce_flags & CE_REMOVE)
>  			removed++;
> diff --git a/read-cache.c b/read-cache.c
> index b30ee75..4529fab 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -30,6 +30,8 @@ static void replace_index_entry(struct index_state *istate, int nr, struct cache
>  {
>  	struct cache_entry *old = istate->cache[nr];
>  
> +	if (istate->partially_read)
> +		index_change_filter_opts(istate, NULL);
>  	remove_name_hash(istate, old);
>  	set_index_entry(istate, nr, ce);
>  	istate->cache_changed = 1;
> @@ -467,6 +469,8 @@ int remove_index_entry_at(struct index_state *istate, int pos)
>  {
>  	struct cache_entry *ce = istate->cache[pos];
>  
> +	if (istate->partially_read)
> +		index_change_filter_opts(istate, NULL);
>  	record_resolve_undo(istate, ce);
>  	remove_name_hash(istate, ce);
>  	istate->cache_changed = 1;
> @@ -978,6 +982,8 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
>  {
>  	int pos;
>  
> +	if (istate->partially_read)
> +		index_change_filter_opts(istate, NULL);
>  	if (option & ADD_CACHE_JUST_APPEND)
>  		pos = istate->cache_nr;
>  	else {
> @@ -1184,6 +1190,8 @@ int refresh_index(struct index_state *istate, unsigned int flags, const char **p
>  				/* If we are doing --really-refresh that
>  				 * means the index is not valid anymore.
>  				 */
> +				if (istate->partially_read)
> +					index_change_filter_opts(istate, NULL);
>  				ce->ce_flags &= ~CE_VALID;
>  				istate->cache_changed = 1;
>  			}

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/22] read-cache: add index reading api
  2013-07-07  8:11 ` [PATCH 05/22] read-cache: add index reading api Thomas Gummerer
  2013-07-08  2:01   ` Duy Nguyen
  2013-07-08  2:19   ` Duy Nguyen
@ 2013-07-08 16:36   ` Junio C Hamano
  2013-07-08 20:10     ` Thomas Gummerer
  2 siblings, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2013-07-08 16:36 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, mhagger, pclouds, robin.rosenberg

Thomas Gummerer <t.gummerer@gmail.com> writes:

> Add an api for access to the index file.  Currently there is only a very
> basic api for accessing the index file, which only allows a full read of
> the index, and lets the users of the data filter it.  The new index api
> gives the users the possibility to use only part of the index and
> provides functions for iterating over and accessing cache entries.
>
> This simplifies future improvements to the in-memory format, as changes
> will be concentrated on one file, instead of the whole git source code.
>
> Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
> ---
>  cache.h         |  57 +++++++++++++++++++++++++++++-
>  read-cache-v2.c |  96 +++++++++++++++++++++++++++++++++++++++++++++++--
>  read-cache.c    | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  read-cache.h    |  12 ++++++-
>  4 files changed, 263 insertions(+), 10 deletions(-)
>
> diff --git a/cache.h b/cache.h
> index 5082b34..d38dfbd 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -127,7 +127,8 @@ struct cache_entry {
>  	unsigned int ce_flags;
>  	unsigned int ce_namelen;
>  	unsigned char sha1[20];
> -	struct cache_entry *next;
> +	struct cache_entry *next; /* used by name_hash */
> +	struct cache_entry *next_ce; /* used to keep a list of cache entries */

The reader often needs to rewind the read-pointer partially while
walking the index (e.g. next_cache_entry() in unpack-trees.c and how
the o->cache_bottom position is used throughout the subsystem).  I
am not sure if this singly-linked list is a good way to go.

> +/*
> + * Options by which the index should be filtered when read partially.
> + *
> + * pathspec: The pathspec which the index entries have to match
> + * seen: Used to return the seen parameter from match_pathspec()
> + * max_prefix, max_prefix_len: These variables are set to the longest
> + *     common prefix, the length of the longest common prefix of the
> + *     given pathspec

These probably should use "struct pathspec" abstration, not just the
"array of raw strings", no?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/22] make sure partially read index is not changed
  2013-07-08 16:31   ` Junio C Hamano
@ 2013-07-08 18:33     ` Thomas Gummerer
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-08 18:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, trast, mhagger, pclouds, robin.rosenberg

Junio C Hamano <gitster@pobox.com> writes:

> Thomas Gummerer <t.gummerer@gmail.com> writes:
>
>> A partially read index file currently cannot be written to disk.  Make
>> sure that never happens, by re-reading the index file if the index file
>> wasn't read completely before changing the in-memory index.
>
> I am not quite sure what you are trying to do.
>
> In operations that modify the index (replace_index_entry(),
> remove_index_entry_at(), etc.)  you lift the filter_ops and keep
> partially_read flag still on.  In the write-out codepath, you have
> an assert to make sure the caller has cleared the partially_read
> flag.  A natural way to clear the flag is to re-read the index from
> the file, but then you can easily lose the modifications.
>
> Also shouldn't the flag be cleared upon discard_index()?  If it is
> done there, you probably would not need to clear it in read_index().

Hrm, maybe the code isn't quite clear enough here, or maybe the patch
should come directly before (16/22) read-cache: read index-v5 to be more
clear.

The flag is always set to 0 in read_index_v2, as the whole index is
always read and therefore it never needs to be reset.  With
read_index_v5 on the other hand the flag is set when the filter_opts are
different than NULL.

But thinking about it, the flag is actually not necessary at all.  The
filter_opts should simply be checked for NULL for the assert and they
should also be set to NULL on discard_index.  Will fix this in the next
version.  Thanks.

> Should
> there be another safety that says "calling read_index() with the
> partially_read flag on is a bug" or something?

I'm not sure.  I think it doesn't hurt, as we discard the index when
we change the index_ops.  At the moment I can't think of a case where
where calling read_index() could be used when it's partially read
without discarding the cache first.  I'll add it in the next version.

>>
>> Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
>> ---
>>  builtin/update-index.c | 4 ++++
>>  cache.h                | 4 +++-
>>  read-cache-v2.c        | 3 +++
>>  read-cache.c           | 8 ++++++++
>>  4 files changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/builtin/update-index.c b/builtin/update-index.c
>> index 5c7762e..03f6426 100644
>> --- a/builtin/update-index.c
>> +++ b/builtin/update-index.c
>> @@ -49,6 +49,8 @@ static int mark_ce_flags(const char *path, int flag, int mark)
>>  	int namelen = strlen(path);
>>  	int pos = cache_name_pos(path, namelen);
>>  	if (0 <= pos) {
>> +		if (active_cache_partially_read)
>> +			cache_change_filter_opts(NULL);
>>  		if (mark)
>>  			active_cache[pos]->ce_flags |= flag;
>>  		else
>> @@ -253,6 +255,8 @@ static void chmod_path(int flip, const char *path)
>>  	pos = cache_name_pos(path, strlen(path));
>>  	if (pos < 0)
>>  		goto fail;
>> +	if (active_cache_partially_read)
>> +		cache_change_filter_opts(NULL);
>>  	ce = active_cache[pos];
>>  	mode = ce->ce_mode;
>>  	if (!S_ISREG(mode))
>> diff --git a/cache.h b/cache.h
>> index d38dfbd..f6c3407 100644
>> --- a/cache.h
>> +++ b/cache.h
>> @@ -293,7 +293,8 @@ struct index_state {
>>  	struct cache_tree *cache_tree;
>>  	struct cache_time timestamp;
>>  	unsigned name_hash_initialized : 1,
>> -		 initialized : 1;
>> +		 initialized : 1,
>> +		 partially_read : 1;
>>  	struct hash_table name_hash;
>>  	struct hash_table dir_hash;
>>  	struct index_ops *ops;
>> @@ -315,6 +316,7 @@ extern void free_name_hash(struct index_state *istate);
>>  #define active_alloc (the_index.cache_alloc)
>>  #define active_cache_changed (the_index.cache_changed)
>>  #define active_cache_tree (the_index.cache_tree)
>> +#define active_cache_partially_read (the_index.partially_read)
>>
>>  #define read_cache() read_index(&the_index)
>>  #define read_cache_from(path) read_index_from(&the_index, (path))
>> diff --git a/read-cache-v2.c b/read-cache-v2.c
>> index 1ed640d..2cc792d 100644
>> --- a/read-cache-v2.c
>> +++ b/read-cache-v2.c
>> @@ -273,6 +273,7 @@ static int read_index_v2(struct index_state *istate, void *mmap,
>>  		src_offset += 8;
>>  		src_offset += extsize;
>>  	}
>> +	istate->partially_read = 0;
>>  	return 0;
>>  unmap:
>>  	munmap(mmap, mmap_size);
>> @@ -495,6 +496,8 @@ static int write_index_v2(struct index_state *istate, int newfd)
>>  	struct stat st;
>>  	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
>>
>> +	if (istate->partially_read)
>> +		die("BUG: index: cannot write a partially read index");
>>  	for (i = removed = extended = 0; i < entries; i++) {
>>  		if (cache[i]->ce_flags & CE_REMOVE)
>>  			removed++;
>> diff --git a/read-cache.c b/read-cache.c
>> index b30ee75..4529fab 100644
>> --- a/read-cache.c
>> +++ b/read-cache.c
>> @@ -30,6 +30,8 @@ static void replace_index_entry(struct index_state *istate, int nr, struct cache
>>  {
>>  	struct cache_entry *old = istate->cache[nr];
>>
>> +	if (istate->partially_read)
>> +		index_change_filter_opts(istate, NULL);
>>  	remove_name_hash(istate, old);
>>  	set_index_entry(istate, nr, ce);
>>  	istate->cache_changed = 1;
>> @@ -467,6 +469,8 @@ int remove_index_entry_at(struct index_state *istate, int pos)
>>  {
>>  	struct cache_entry *ce = istate->cache[pos];
>>
>> +	if (istate->partially_read)
>> +		index_change_filter_opts(istate, NULL);
>>  	record_resolve_undo(istate, ce);
>>  	remove_name_hash(istate, ce);
>>  	istate->cache_changed = 1;
>> @@ -978,6 +982,8 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
>>  {
>>  	int pos;
>>
>> +	if (istate->partially_read)
>> +		index_change_filter_opts(istate, NULL);
>>  	if (option & ADD_CACHE_JUST_APPEND)
>>  		pos = istate->cache_nr;
>>  	else {
>> @@ -1184,6 +1190,8 @@ int refresh_index(struct index_state *istate, unsigned int flags, const char **p
>>  				/* If we are doing --really-refresh that
>>  				 * means the index is not valid anymore.
>>  				 */
>> +				if (istate->partially_read)
>> +					index_change_filter_opts(istate, NULL);
>>  				ce->ce_flags &= ~CE_VALID;
>>  				istate->cache_changed = 1;
>>  			}

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/22] read-cache: add index reading api
  2013-07-08 16:36   ` [PATCH 05/22] read-cache: add index reading api Junio C Hamano
@ 2013-07-08 20:10     ` Thomas Gummerer
  2013-07-08 23:09       ` Junio C Hamano
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-08 20:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, trast, mhagger, pclouds, robin.rosenberg

Junio C Hamano <gitster@pobox.com> writes:

> Thomas Gummerer <t.gummerer@gmail.com> writes:
>
>> Add an api for access to the index file.  Currently there is only a very
>> basic api for accessing the index file, which only allows a full read of
>> the index, and lets the users of the data filter it.  The new index api
>> gives the users the possibility to use only part of the index and
>> provides functions for iterating over and accessing cache entries.
>>
>> This simplifies future improvements to the in-memory format, as changes
>> will be concentrated on one file, instead of the whole git source code.
>>
>> Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
>> ---
>>  cache.h         |  57 +++++++++++++++++++++++++++++-
>>  read-cache-v2.c |  96 +++++++++++++++++++++++++++++++++++++++++++++++--
>>  read-cache.c    | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>  read-cache.h    |  12 ++++++-
>>  4 files changed, 263 insertions(+), 10 deletions(-)
>>
>> diff --git a/cache.h b/cache.h
>> index 5082b34..d38dfbd 100644
>> --- a/cache.h
>> +++ b/cache.h
>> @@ -127,7 +127,8 @@ struct cache_entry {
>>  	unsigned int ce_flags;
>>  	unsigned int ce_namelen;
>>  	unsigned char sha1[20];
>> -	struct cache_entry *next;
>> +	struct cache_entry *next; /* used by name_hash */
>> +	struct cache_entry *next_ce; /* used to keep a list of cache entries */
>
> The reader often needs to rewind the read-pointer partially while
> walking the index (e.g. next_cache_entry() in unpack-trees.c and how
> the o->cache_bottom position is used throughout the subsystem).  I
> am not sure if this singly-linked list is a good way to go.

I'm not very familiar with the unpack-trees code, but from a quick look
the pointer (or position in the cache) is always only moved forward.  A
problem I do see though is skipping a number of entries at once.  An
example for that below:
			int matches;
			matches = cache_tree_matches_traversal(o->src_index->cache_tree,
							       names, info);
			/*
			 * Everything under the name matches; skip the
			 * entire hierarchy.  diff_index_cached codepath
			 * special cases D/F conflicts in such a way that
			 * it does not do any look-ahead, so this is safe.
			 */
			if (matches) {
				o->cache_bottom += matches;
				return mask;
			}

This could probably be transformed into something like
skip_cache_tree_matches(cache-tree, names, info);

I'll take some time to familiarize myself with the unpack-trees code to
see if I can find a better solution than this, and if there are more
pitfalls.

>> +/*
>> + * Options by which the index should be filtered when read partially.
>> + *
>> + * pathspec: The pathspec which the index entries have to match
>> + * seen: Used to return the seen parameter from match_pathspec()
>> + * max_prefix, max_prefix_len: These variables are set to the longest
>> + *     common prefix, the length of the longest common prefix of the
>> + *     given pathspec
>
> These probably should use "struct pathspec" abstration, not just the
> "array of raw strings", no?

Yes, thanks, that's probably a good idea.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 5.5/22] Add documentation for the index api
  2013-07-08 12:45       ` Duy Nguyen
  2013-07-08 13:37         ` Thomas Gummerer
@ 2013-07-08 20:54         ` Thomas Gummerer
  2013-07-09 15:42           ` Duy Nguyen
  1 sibling, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-08 20:54 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

Document the new index api and add examples of how it should be used
instead of the old functions directly accessing the index.

Helped-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
---

Duy Nguyen <pclouds@gmail.com> writes:

> Hmm.. I was confused actually (documentation on the api would help
> greatly).

As promised, a draft for a documentation for the index api as it is in
this series.

Documentation/technical/api-in-core-index.txt | 108 +++++++++++++++++++++++++-
 1 file changed, 106 insertions(+), 2 deletions(-)

diff --git a/Documentation/technical/api-in-core-index.txt b/Documentation/technical/api-in-core-index.txt
index adbdbf5..5269bb1 100644
--- a/Documentation/technical/api-in-core-index.txt
+++ b/Documentation/technical/api-in-core-index.txt
@@ -1,14 +1,116 @@
 in-core index API
 =================

+Reading API
+-----------
+
+`read_index()`::
+	Read the whole index file from disk.
+
+`index_name_pos(name, namelen)`::
+	Find a cache_entry with name in the index.  Returns pos if an
+	entry is matched exactly and -pos-1 if an entry is matched
+	partially.
+	e.g.
+	index:
+	file1
+	file2
+	path/file1
+	zzz
+
+	index_name_pos("path/file1", 10) returns 2, while
+	index_name_pos("path", 4) returns -1
+
+`read_index_filtered(opts)`::
+	This method behaves differently for index-v2 and index-v5.
+
+	For index-v2 it simply reads the whole index as read_index()
+	does, so we are sure we don't have to reload anything if the
+	user wants a different filter.  It also sets the filter_opts
+	in the index_state, which is used to limit the results when
+	iterating over the index with for_each_index_entry().
+
+	The whole index is read to avoid the need to eventually
+	re-read the index later, because the performance is no
+	different when reading it partially.
+
+	For index-v5 it creates an adjusted_pathspec to filter the
+	reading.  First all the directory entries are read and then
+	the cache_entries in the directories that match the adjusted
+	pathspec are read.  The filter_opts in the index_state are set
+	to filter out the rest of the cache_entries that are matched
+	by the adjusted pathspec but not by the pathspec given.  The
+	rest of the index entries are filtered out when iterating over
+	the cache with for_each_index_entries.
+
+`get_index_entry_by_name(name, namelen, &ce)`::
+	Returns a cache_entry matched by the name, returned via the
+	&ce parameter.  If a cache entry is matched exactly, 1 is
+	returned, otherwise 0.  For an example see index_name_pos().
+	This function should be used instead of the index_name_pos()
+	function to retrieve cache entries.
+
+`for_each_index_entry(fn, cb_data)`::
+	Iterates over all cache_entries in the index filtered by
+	filter_opts in the index_stat.  For each cache entry fn is
+	executed with cb_data as callback data.  From within the loop
+	do `return 0` to continue, or `return 1` to break the loop.
+
+`next_index_entry(ce)`::
+	Returns the cache_entry that follows after ce
+
+`index_change_filter_opts(opts)`::
+	This function again has a slightly different functionality for
+	index-v2 and index-v5.
+
+	For index-v2 it simply changes the filter_opts, so
+	for_each_index_entry uses the changed index_opts, to iterate
+	over a different set of cache entries.
+
+	For index-v5 it refreshes the index if the filter_opts have
+	changed and sets the new filter_opts in the index state, again
+	to iterate over a different set of cache entries as with
+	index-v2.
+
+	This has some optimization potential, in the case that the
+	opts get stricter (less of the index should be read) it
+	doesn't have to reload anything, but currently does.
+
+Using the new index api
+-----------------------
+
+Currently loops over a specific set of index entry were written as:
+  i = start_index;
+  while (i < active_nr) { ce = active_cache[i]; do(something); i++; }
+
+they should be rewritten to:
+  ce = start;
+  while (ce) { do(something); ce = next_cache_entry(ce); }
+
+which is the equivalent operation but hides the in-memory format of
+the index from the user.
+
+For getting a cache entry get_cache_entry_by_name() should be used
+instead of cache_name_pos(). e.g.:
+  int pos = cache_name_pos(name, namelen);
+  struct cache_entry *ce = active_cache[pos];
+  if (pos < 0) { do(something) }
+  else { do(somethingelse) }
+
+should be written as:
+  struct cache_entry *ce;
+  int ret = get_cache_entry_by_name(name, namelen, &ce);
+  if (!ret) { do(something) }
+  else { do(somethingelse) }
+
+TODO
+----
 Talk about <read-cache.c> and <cache-tree.c>, things like:

 * cache -> the_index macros
-* read_index()
 * write_index()
 * ie_match_stat() and ie_modified(); how they are different and when to
   use which.
-* index_name_pos()
 * remove_index_entry_at()
 * remove_file_from_index()
 * add_file_to_index()
@@ -18,4 +120,6 @@ Talk about <read-cache.c> and <cache-tree.c>, things like:
 * cache_tree_invalidate_path()
 * cache_tree_update()

+
+
 (JC, Linus)
--
1.8.3.453.g1dfc63d

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/22] read-cache: add index reading api
  2013-07-08 20:10     ` Thomas Gummerer
@ 2013-07-08 23:09       ` Junio C Hamano
  2013-07-09 20:13         ` Thomas Gummerer
  0 siblings, 1 reply; 51+ messages in thread
From: Junio C Hamano @ 2013-07-08 23:09 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, mhagger, pclouds, robin.rosenberg

Thomas Gummerer <t.gummerer@gmail.com> writes:

>> The reader often needs to rewind the read-pointer partially while
>> walking the index (e.g. next_cache_entry() in unpack-trees.c and how
>> the o->cache_bottom position is used throughout the subsystem).  I
>> am not sure if this singly-linked list is a good way to go.
>
> I'm not very familiar with the unpack-trees code, but from a quick look
> the pointer (or position in the cache) is always only moved forward.

I am more worried about o->cache_bottom processing, where it
currently is an index into an array.

With your ce->next_in_list_of_read_entries change, a natural rewrite
would be to point at the ce with o->cache_bottom, but then that
would mean you cannot in-place replace the entries like we used to
be able to in an array based implementation.

But your series does not seem to touch unpack-trees yet, so I may be
worried too much before it becomes necessary.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5.5/22] Add documentation for the index api
  2013-07-08 20:54         ` [PATCH 5.5/22] Add documentation for the index api Thomas Gummerer
@ 2013-07-09 15:42           ` Duy Nguyen
  2013-07-09 20:10             ` Thomas Gummerer
  0 siblings, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-09 15:42 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

On Tue, Jul 9, 2013 at 3:54 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> As promised, a draft for a documentation for the index api as it is in
> this series.

First of all, it may be a good idea to acknowledge
index_state->cache[] as part of the API for now. Not hiding it
simplifies a few things (no need for new next_ce field, no worries
about rewinding in unpack-trees..). Supporting partial loading (and
maybe partial update in some cases) with this API and
index_state->cache[] part of the API are good enough for me. We can do
another tree-based API or something update later when it's formed (I
looked at your index-v5api branch but I don't think a tree-based api
was there, my concern is how much extra head pre-v5 has to pay to use
tree-based api).

> +`read_index_filtered(opts)`::
> +       This method behaves differently for index-v2 and index-v5.
> +
> +       For index-v2 it simply reads the whole index as read_index()
> +       does, so we are sure we don't have to reload anything if the
> +       user wants a different filter.  It also sets the filter_opts
> +       in the index_state, which is used to limit the results when
> +       iterating over the index with for_each_index_entry().
> +
> +       The whole index is read to avoid the need to eventually
> +       re-read the index later, because the performance is no
> +       different when reading it partially.
> +
> +       For index-v5 it creates an adjusted_pathspec to filter the
> +       reading.  First all the directory entries are read and then
> +       the cache_entries in the directories that match the adjusted
> +       pathspec are read.  The filter_opts in the index_state are set
> +       to filter out the rest of the cache_entries that are matched
> +       by the adjusted pathspec but not by the pathspec given.  The
> +       rest of the index entries are filtered out when iterating over
> +       the cache with for_each_index_entries.

You can state in the API that the input pathspec is used as a hint to
load only a portion of the index. read_index_filtered may load _more_
than necessary. It's the caller's responsibility to verify again which
is matched and which is not. That's how read_directory is done. I
think it gives you more liberty in loading strategy. It's already true
for v2 because full index is loaded regardless of the given pathspec.
In the end, we have a linear list (from public view) of cache entries,
accessible via index_state->cache[].

If you happen to know that certain entries match the given pathspec,
you could help the caller avoid match_pathspec'ing again by set a bit
in ce_flags.  To know which entry exists in the index and which is
new, use another flag. Most reader code won't change if we do it this
way, all match_pathspec() remain where they are.

> +`for_each_index_entry(fn, cb_data)`::
> +       Iterates over all cache_entries in the index filtered by
> +       filter_opts in the index_stat.  For each cache entry fn is
> +       executed with cb_data as callback data.  From within the loop
> +       do `return 0` to continue, or `return 1` to break the loop.

Because we don't attempt to hide index_state->cache[], this one may be
for convenience, the user is not required to convert to it. Actually I
think this may be slower because of the cost of calling function
pointer.

> +`next_index_entry(ce)`::
> +       Returns the cache_entry that follows after ce

next_ce field and this method may be gone too, just access index_state->cache[]

> +`index_change_filter_opts(opts)`::
> +       This function again has a slightly different functionality for
> +       index-v2 and index-v5.
> +
> +       For index-v2 it simply changes the filter_opts, so
> +       for_each_index_entry uses the changed index_opts, to iterate
> +       over a different set of cache entries.
> +
> +       For index-v5 it refreshes the index if the filter_opts have
> +       changed and sets the new filter_opts in the index state, again
> +       to iterate over a different set of cache entries as with
> +       index-v2.
> +
> +       This has some optimization potential, in the case that the
> +       opts get stricter (less of the index should be read) it
> +       doesn't have to reload anything, but currently does.

The only use case I see so far is converting a partial index_state
back to a full one. Apart from doing so in order to write the new
index, I think some operation (like rename tracking in diff or
unpack-trees) may expect full index. I think we should support that. I
doubt we need to change pathspec to something different than the one
we used to load the index. When a user passes a pathspec to a command,
the user expects the command to operate on that set only, not outside.

If you take the input pathspec at loading just as a hint, you could
load all related directory blocks and all files in those blocks, so
that expanding to full index is simply adding more files from missing
directory blocks (and their files). An advantage of not strictly
follow the input pathspec.

Some thoughts about the writing api.

In think we should avoid automatically converting partial index into a
full one before writing. Push that back to the caller and die() when
asked to update partial index. They know at what point the index may
be updated and even what part of it may be updated. I think all
commands fall into two categories, tree-wide updates (merge,
checkout...) and limited by the user-given pathspec. "what part to be
updated" is not so hard to determine.

If the caller promises not to update or read outside certain pathspec
(part of API), we don't really need to load full index until
write_index is called. At that point I think we also know what
directory blocks are completely untouched by the caller (by promise)
and could copy them over from the old index byte-by-byte (or something
close, some offsets may be recalculated). That may keep tree compiling
cost proportional to the number of changed entries, not the number of
entries in index.

There is another partial write case that's not covered this round (or
was it discussed and discarded?): refreshing the index. This operation
could be treated as a standalone, special one: refresh and update the
index file directly without waiting until write_index is called. There
are some commands that follow this scheme by doing
update_index_if_able() after refresh_index(). Those will be cheaper
with v5 because we don't need write a full new index.
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5.5/22] Add documentation for the index api
  2013-07-09 15:42           ` Duy Nguyen
@ 2013-07-09 20:10             ` Thomas Gummerer
  2013-07-10  5:28               ` Duy Nguyen
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-09 20:10 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

Duy Nguyen <pclouds@gmail.com> writes:

> On Tue, Jul 9, 2013 at 3:54 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> As promised, a draft for a documentation for the index api as it is in
>> this series.
>
> First of all, it may be a good idea to acknowledge
> index_state->cache[] as part of the API for now. Not hiding it
> simplifies a few things (no need for new next_ce field, no worries
> about rewinding in unpack-trees..). Supporting partial loading (and
> maybe partial update in some cases) with this API and
> index_state->cache[] part of the API are good enough for me. We can do
> another tree-based API or something update later when it's formed (I
> looked at your index-v5api branch but I don't think a tree-based api
> was there, my concern is how much extra head pre-v5 has to pay to use
> tree-based api).

Yes, I think you're right, that simplifies everything a lot, while we
still can have partial loading.  Hiding index_state->cache[] was mainly
thought for future changes for the in-memory format, but I think that
will not be happening for a while.

>> +`read_index_filtered(opts)`::
>> +       This method behaves differently for index-v2 and index-v5.
>> +
>> +       For index-v2 it simply reads the whole index as read_index()
>> +       does, so we are sure we don't have to reload anything if the
>> +       user wants a different filter.  It also sets the filter_opts
>> +       in the index_state, which is used to limit the results when
>> +       iterating over the index with for_each_index_entry().
>> +
>> +       The whole index is read to avoid the need to eventually
>> +       re-read the index later, because the performance is no
>> +       different when reading it partially.
>> +
>> +       For index-v5 it creates an adjusted_pathspec to filter the
>> +       reading.  First all the directory entries are read and then
>> +       the cache_entries in the directories that match the adjusted
>> +       pathspec are read.  The filter_opts in the index_state are set
>> +       to filter out the rest of the cache_entries that are matched
>> +       by the adjusted pathspec but not by the pathspec given.  The
>> +       rest of the index entries are filtered out when iterating over
>> +       the cache with for_each_index_entries.
>
> You can state in the API that the input pathspec is used as a hint to
> load only a portion of the index. read_index_filtered may load _more_
> than necessary. It's the caller's responsibility to verify again which
> is matched and which is not. That's how read_directory is done. I
> think it gives you more liberty in loading strategy. It's already true
> for v2 because full index is loaded regardless of the given pathspec.
> In the end, we have a linear list (from public view) of cache entries,
> accessible via index_state->cache[].

Yes, and it's also partly true for index-v5, as the full content of a
directory is loaded even if only some files it it match the pathspec
that's given.

> If you happen to know that certain entries match the given pathspec,
> you could help the caller avoid match_pathspec'ing again by set a bit
> in ce_flags.

I currently don't know which entries do match the pathspec from just
reading the index file, additional calls would be needed.  I don't think
that would be worth the overhead.

> To know which entry exists in the index and which is
> new, use another flag. Most reader code won't change if we do it this
> way, all match_pathspec() remain where they are.

Hrm you mean to know which cache entries are added (or changed) in the
in-memory index and will have to be written later?  I'm not sure I
understand correctly what you mean here.

>> +`for_each_index_entry(fn, cb_data)`::
>> +       Iterates over all cache_entries in the index filtered by
>> +       filter_opts in the index_stat.  For each cache entry fn is
>> +       executed with cb_data as callback data.  From within the loop
>> +       do `return 0` to continue, or `return 1` to break the loop.
>
> Because we don't attempt to hide index_state->cache[], this one may be
> for convenience, the user is not required to convert to it. Actually I
> think this may be slower because of the cost of calling function
> pointer.

Yes right, I think you're right.  In fact I just tested it, and it's
slightly slower.

I still think it would make sense to keep it around, for the callers
that want the cache filtered exactly by the filter_opts, for convenience
as you said.

>> +`next_index_entry(ce)`::
>> +       Returns the cache_entry that follows after ce
>
> next_ce field and this method may be gone too, just access index_state->cache[]

Yes, this makes no sense when we're not hiding index_state->cache[].
The same goes for the get_index_entry_by_name function, which is
essentially the same as using index_name_pos and then getting the cache
entry from index_state->cache[].

>> +`index_change_filter_opts(opts)`::
>> +       This function again has a slightly different functionality for
>> +       index-v2 and index-v5.
>> +
>> +       For index-v2 it simply changes the filter_opts, so
>> +       for_each_index_entry uses the changed index_opts, to iterate
>> +       over a different set of cache entries.
>> +
>> +       For index-v5 it refreshes the index if the filter_opts have
>> +       changed and sets the new filter_opts in the index state, again
>> +       to iterate over a different set of cache entries as with
>> +       index-v2.
>> +
>> +       This has some optimization potential, in the case that the
>> +       opts get stricter (less of the index should be read) it
>> +       doesn't have to reload anything, but currently does.
>
> The only use case I see so far is converting a partial index_state
> back to a full one. Apart from doing so in order to write the new
> index, I think some operation (like rename tracking in diff or
> unpack-trees) may expect full index. I think we should support that. I
> doubt we need to change pathspec to something different than the one
> we used to load the index. When a user passes a pathspec to a command,
> the user expects the command to operate on that set only, not outside.

One application was in ls-files, where we strip the trailing slash from
the pathspecs for submodules.  But when we let the caller filter the
rest out it's not needed anymore.  We load all entries without the
trailing slash anyway.

> If you take the input pathspec at loading just as a hint, you could
> load all related directory blocks and all files in those blocks, so
> that expanding to full index is simply adding more files from missing
> directory blocks (and their files). An advantage of not strictly
> follow the input pathspec.

I actually already do that with the adjusted pathspec.  Even with
index-v5 currently there are some more entries loaded than actually
matched by the pathspecs.  Expanding to the full index still takes some
work, but should be possible.

> Some thoughts about the writing api.
>
> In think we should avoid automatically converting partial index into a
> full one before writing. Push that back to the caller and die() when
> asked to update partial index. They know at what point the index may
> be updated and even what part of it may be updated. I think all
> commands fall into two categories, tree-wide updates (merge,
> checkout...) and limited by the user-given pathspec. "what part to be
> updated" is not so hard to determine.

Hrm this is only true if index entries are added or removed, not if they
are only changed.  If they are only changed we can write a partially
read index once we have partial writing.  For now it would make sense to
just die() though, until we have that in place.

> If the caller promises not to update or read outside certain pathspec
> (part of API), we don't really need to load full index until
> write_index is called. At that point I think we also know what
> directory blocks are completely untouched by the caller (by promise)
> and could copy them over from the old index byte-by-byte (or something
> close, some offsets may be recalculated). That may keep tree compiling
> cost proportional to the number of changed entries, not the number of
> entries in index.

Yes that would make sense.  I think that should go in a follow-up series
though as it would be quite some work.

> There is another partial write case that's not covered this round (or
> was it discussed and discarded?): refreshing the index. This operation
> could be treated as a standalone, special one: refresh and update the
> index file directly without waiting until write_index is called. There
> are some commands that follow this scheme by doing
> update_index_if_able() after refresh_index(). Those will be cheaper
> with v5 because we don't need write a full new index.

I don't think it was discussed yet.  Partial reading will need a change
to the lock-file structure though, so I think it's a little more
complicated.

Thanks for your comments, I'll try to address them and send a new series
in a couple of days.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/22] read-cache: add index reading api
  2013-07-08 23:09       ` Junio C Hamano
@ 2013-07-09 20:13         ` Thomas Gummerer
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-09 20:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, trast, mhagger, pclouds, robin.rosenberg

Junio C Hamano <gitster@pobox.com> writes:

> Thomas Gummerer <t.gummerer@gmail.com> writes:
>
>>> The reader often needs to rewind the read-pointer partially while
>>> walking the index (e.g. next_cache_entry() in unpack-trees.c and how
>>> the o->cache_bottom position is used throughout the subsystem).  I
>>> am not sure if this singly-linked list is a good way to go.
>>
>> I'm not very familiar with the unpack-trees code, but from a quick look
>> the pointer (or position in the cache) is always only moved forward.
>
> I am more worried about o->cache_bottom processing, where it
> currently is an index into an array.
>
> With your ce->next_in_list_of_read_entries change, a natural rewrite
> would be to point at the ce with o->cache_bottom, but then that
> would mean you cannot in-place replace the entries like we used to
> be able to in an array based implementation.
>
> But your series does not seem to touch unpack-trees yet, so I may be
> worried too much before it becomes necessary.

Yes, you're right, as Duy mentioned in the other email I just responded
to it makes sense to keep the index around for now.

I looked at the unpack-trees code a bit, and adding a new api and hiding
index_state->cache[] will probably be a bit harder to do than I
originally thought, so it's best to keep that around for now, as we're
still able to get the benefits from partial loading even if it's not
hidden.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5.5/22] Add documentation for the index api
  2013-07-09 20:10             ` Thomas Gummerer
@ 2013-07-10  5:28               ` Duy Nguyen
  2013-07-11 11:30                 ` Thomas Gummerer
  0 siblings, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-10  5:28 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

On Wed, Jul 10, 2013 at 3:10 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> If you happen to know that certain entries match the given pathspec,
>> you could help the caller avoid match_pathspec'ing again by set a bit
>> in ce_flags.
>
> I currently don't know which entries do match the pathspec from just
> reading the index file, additional calls would be needed.  I don't think
> that would be worth the overhead.

Yeah I now see that you select what to load in v5 with the adjusted
pathspec, not the input pathspec. Originally I thought you match the
input pathspec against every file entry in the index :P Your adjusted
pathspec looks like what common_prefix is for. It's cheaper than
creating adjusted_pathspec from match_pathspec and reduces loading in
major cases, where glob is not used.

Still, creating an adjusted pathspec this way looks iffy. You need to
understand pathspec in order to strip the filename part out to match
the directory match only. An alternative is use
tree_entry_interesting. It goes along well with tree traversal and can
be used to match directories with original pathspec. Once you see it
matches an entry in a directory, you could skip matching the rest of
the files and load the whole directory. read_index_filtered_v5 and
read_entries may need some tweaking though. I'll try it and post a
patch later if I succeed.

>> To know which entry exists in the index and which is
>> new, use another flag. Most reader code won't change if we do it this
>> way, all match_pathspec() remain where they are.
>
> Hrm you mean to know which cache entries are added (or changed) in the
> in-memory index and will have to be written later?  I'm not sure I
> understand correctly what you mean here.

Oh.. The "to know.." sentence was nonsense. We probably don't need to
know. We may track changed entries for partial writing, but let's
leave that out for now.

>>> +`index_change_filter_opts(opts)`::
>>> +       This function again has a slightly different functionality for
>>> +       index-v2 and index-v5.
>>> +
>>> +       For index-v2 it simply changes the filter_opts, so
>>> +       for_each_index_entry uses the changed index_opts, to iterate
>>> +       over a different set of cache entries.
>>> +
>>> +       For index-v5 it refreshes the index if the filter_opts have
>>> +       changed and sets the new filter_opts in the index state, again
>>> +       to iterate over a different set of cache entries as with
>>> +       index-v2.
>>> +
>>> +       This has some optimization potential, in the case that the
>>> +       opts get stricter (less of the index should be read) it
>>> +       doesn't have to reload anything, but currently does.
>>
>> The only use case I see so far is converting a partial index_state
>> back to a full one. Apart from doing so in order to write the new
>> index, I think some operation (like rename tracking in diff or
>> unpack-trees) may expect full index. I think we should support that. I
>> doubt we need to change pathspec to something different than the one
>> we used to load the index. When a user passes a pathspec to a command,
>> the user expects the command to operate on that set only, not outside.
>
> One application was in ls-files, where we strip the trailing slash from
> the pathspecs for submodules.  But when we let the caller filter the
> rest out it's not needed anymore.  We load all entries without the
> trailing slash anyway.

That submodule trailing slash stripping code will be moved away soon
(I've been working on it for some time now). There's similar code in
pathspec.c. I hope by the time this series becomes a candidate for
'next', those pathspec manipulation is already gone. For
strip_trailing_slash_from_submodules, peeking in index file for a few
entries is probably ok. For check_path_for_gitlink, full index is
loaded until we figure out a clever way.

>> Some thoughts about the writing api.
>>
>> In think we should avoid automatically converting partial index into a
>> full one before writing. Push that back to the caller and die() when
>> asked to update partial index. They know at what point the index may
>> be updated and even what part of it may be updated. I think all
>> commands fall into two categories, tree-wide updates (merge,
>> checkout...) and limited by the user-given pathspec. "what part to be
>> updated" is not so hard to determine.
>
> Hrm this is only true if index entries are added or removed, not if they
> are only changed.  If they are only changed we can write a partially
> read index once we have partial writing.

Yep. We can detect if changes are updates only, no additions nor
removals. If so do partial write, else full write. These little
details are hidden from the user, as long as they keep their promise
about read/write regions.

> For now it would make sense to just die() though, until we have that in place.

Agreed.
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 13/22] documentation: add documentation of the index-v5 file format
  2013-07-07  8:11 ` [PATCH 13/22] documentation: add documentation of the index-v5 file format Thomas Gummerer
@ 2013-07-11 10:39   ` Duy Nguyen
  2013-07-11 11:39     ` Thomas Gummerer
  0 siblings, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-11 10:39 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

On Sun, Jul 7, 2013 at 3:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> +== File entry (fileentries)
> +
> +  File entries are sorted in ascending order on the name field, after the
> +  respective offset given by the directory entries. All file names are
> +  prefix compressed, meaning the file name is relative to the directory.
> +
> +  filename (variable length, nul terminated). The exact encoding is
> +    undefined, but the filename cannot contain a NUL byte (iow, the same
> +    encoding as a UNIX pathname).
> +
> +  flags (16-bits): 'flags' field split into (high to low bits)
> +
> +    assumevalid (1-bit): assume-valid flag
> +
> +    intenttoadd (1-bit): intent-to-add flag, used by "git add -N".
> +      Extended flag in index v3.
> +
> +    stage (2-bit): stage of the file during merge
> +
> +    skipworktree (1-bit): skip-worktree flag, used by sparse checkout.
> +      Extended flag in index v3.
> +
> +    smudged (1-bit): indicates if the file is racily smudged.
> +
> +    10-bit unused, must be zero [6]
> +
> +  mode (16-bits): file mode, split into (high to low bits)
> +
> +    objtype (4-bits): object type
> +      valid values in binary are 1000 (regular file), 1010 (symbolic
> +      link) and 1110 (gitlink)
> +
> +    3-bit unused
> +
> +    permission (9-bits): unix permission. Only 0755 and 0644 are valid
> +      for regular files. Symbolic links and gitlinks have value 0 in
> +      this field.
> +
> +  mtimes (32-bits): mtime seconds, the last time a file's data changed
> +    this is stat(2) data
> +
> +  mtimens (32-bits): mtime nanosecond fractions
> +    this is stat(2) data
> +
> +  file size (32-bits): The on-disk size, trucated to 32-bit.
> +    this is stat(2) data
> +
> +  statcrc (32-bits): crc32 checksum over ctime seconds, ctime
> +    nanoseconds, ino, dev, uid, gid (All stat(2) data
> +    except mtime and file size). If the statcrc is 0 it will
> +    be ignored. [7]
> +
> +  objhash (160-bits): SHA-1 for the represented object
> +
> +  entrycrc (32-bits): crc32 checksum for the file entry. The crc code
> +    includes the offset to the offset to the file, relative to the
> +    beginning of the file.

Question about the possibility of updating index file directly. If git
updates a few fields of an entry (but not entrycrc yet) and crashes,
the entry would become corrupt because its entrycrc does not match the
content. What do we do? Do we need to save a copy of the entry
somewhere in the index file (maybe in the conflict data section), so
that the reader can recover the index? Losing the index because of
bugs is big deal in my opinion. pre-v5 never faces this because we
keep the original copy til the end.

Maybe entrycrc should not cover stat fields and statcrc. It would make
refreshing safer. If the above happens during refresh, only statcrc is
corrupt and we can just refresh the entry. entrycrc still says the
other fields are good (and they are).
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5.5/22] Add documentation for the index api
  2013-07-10  5:28               ` Duy Nguyen
@ 2013-07-11 11:30                 ` Thomas Gummerer
  2013-07-11 11:42                   ` Duy Nguyen
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-11 11:30 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

Duy Nguyen <pclouds@gmail.com> writes:

> On Wed, Jul 10, 2013 at 3:10 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>>> If you happen to know that certain entries match the given pathspec,
>>> you could help the caller avoid match_pathspec'ing again by set a bit
>>> in ce_flags.
>>
>> I currently don't know which entries do match the pathspec from just
>> reading the index file, additional calls would be needed.  I don't think
>> that would be worth the overhead.
>
> Yeah I now see that you select what to load in v5 with the adjusted
> pathspec, not the input pathspec. Originally I thought you match the
> input pathspec against every file entry in the index :P Your adjusted
> pathspec looks like what common_prefix is for. It's cheaper than
> creating adjusted_pathspec from match_pathspec and reduces loading in
> major cases, where glob is not used.
>
> Still, creating an adjusted pathspec this way looks iffy. You need to
> understand pathspec in order to strip the filename part out to match
> the directory match only. An alternative is use
> tree_entry_interesting. It goes along well with tree traversal and can
> be used to match directories with original pathspec. Once you see it
> matches an entry in a directory, you could skip matching the rest of
> the files and load the whole directory. read_index_filtered_v5 and
> read_entries may need some tweaking though. I'll try it and post a
> patch later if I succeed.

Hrm, I played around a bit with this idea, but I couldn't figure out how
to make it work.  For it to work we would still have to load some
entries in a directory at least?  Or is there a way to match the
directories, which I just haven't figured out yet?

>>> To know which entry exists in the index and which is
>>> new, use another flag. Most reader code won't change if we do it this
>>> way, all match_pathspec() remain where they are.
>>
>> Hrm you mean to know which cache entries are added (or changed) in the
>> in-memory index and will have to be written later?  I'm not sure I
>> understand correctly what you mean here.
>
> Oh.. The "to know.." sentence was nonsense. We probably don't need to
> know. We may track changed entries for partial writing, but let's
> leave that out for now.

Ok, makes sense.

>>>> +`index_change_filter_opts(opts)`::
>>>> +       This function again has a slightly different functionality for
>>>> +       index-v2 and index-v5.
>>>> +
>>>> +       For index-v2 it simply changes the filter_opts, so
>>>> +       for_each_index_entry uses the changed index_opts, to iterate
>>>> +       over a different set of cache entries.
>>>> +
>>>> +       For index-v5 it refreshes the index if the filter_opts have
>>>> +       changed and sets the new filter_opts in the index state, again
>>>> +       to iterate over a different set of cache entries as with
>>>> +       index-v2.
>>>> +
>>>> +       This has some optimization potential, in the case that the
>>>> +       opts get stricter (less of the index should be read) it
>>>> +       doesn't have to reload anything, but currently does.
>>>
>>> The only use case I see so far is converting a partial index_state
>>> back to a full one. Apart from doing so in order to write the new
>>> index, I think some operation (like rename tracking in diff or
>>> unpack-trees) may expect full index. I think we should support that. I
>>> doubt we need to change pathspec to something different than the one
>>> we used to load the index. When a user passes a pathspec to a command,
>>> the user expects the command to operate on that set only, not outside.
>>
>> One application was in ls-files, where we strip the trailing slash from
>> the pathspecs for submodules.  But when we let the caller filter the
>> rest out it's not needed anymore.  We load all entries without the
>> trailing slash anyway.
>
> That submodule trailing slash stripping code will be moved away soon
> (I've been working on it for some time now). There's similar code in
> pathspec.c. I hope by the time this series becomes a candidate for
> 'next', those pathspec manipulation is already gone. For
> strip_trailing_slash_from_submodules, peeking in index file for a few
> entries is probably ok. For check_path_for_gitlink, full index is
> loaded until we figure out a clever way.

Ah great, for now I'll just not use the for_each_index_entry function in
ls-files, and then change the code later once the stripping code is
moved away.

>>> Some thoughts about the writing api.
>>>
>>> In think we should avoid automatically converting partial index into a
>>> full one before writing. Push that back to the caller and die() when
>>> asked to update partial index. They know at what point the index may
>>> be updated and even what part of it may be updated. I think all
>>> commands fall into two categories, tree-wide updates (merge,
>>> checkout...) and limited by the user-given pathspec. "what part to be
>>> updated" is not so hard to determine.
>>
>> Hrm this is only true if index entries are added or removed, not if they
>> are only changed.  If they are only changed we can write a partially
>> read index once we have partial writing.
>
> Yep. We can detect if changes are updates only, no additions nor
> removals. If so do partial write, else full write. These little
> details are hidden from the user, as long as they keep their promise
> about read/write regions.
>
>> For now it would make sense to just die() though, until we have that in place.
>
> Agreed.
> --
> Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 13/22] documentation: add documentation of the index-v5 file format
  2013-07-11 10:39   ` Duy Nguyen
@ 2013-07-11 11:39     ` Thomas Gummerer
  2013-07-11 11:47       ` Duy Nguyen
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-11 11:39 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

Duy Nguyen <pclouds@gmail.com> writes:

> On Sun, Jul 7, 2013 at 3:11 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> +== File entry (fileentries)
>> +
>> +  File entries are sorted in ascending order on the name field, after the
>> +  respective offset given by the directory entries. All file names are
>> +  prefix compressed, meaning the file name is relative to the directory.
>> +
>> +  filename (variable length, nul terminated). The exact encoding is
>> +    undefined, but the filename cannot contain a NUL byte (iow, the same
>> +    encoding as a UNIX pathname).
>> +
>> +  flags (16-bits): 'flags' field split into (high to low bits)
>> +
>> +    assumevalid (1-bit): assume-valid flag
>> +
>> +    intenttoadd (1-bit): intent-to-add flag, used by "git add -N".
>> +      Extended flag in index v3.
>> +
>> +    stage (2-bit): stage of the file during merge
>> +
>> +    skipworktree (1-bit): skip-worktree flag, used by sparse checkout.
>> +      Extended flag in index v3.
>> +
>> +    smudged (1-bit): indicates if the file is racily smudged.
>> +
>> +    10-bit unused, must be zero [6]
>> +
>> +  mode (16-bits): file mode, split into (high to low bits)
>> +
>> +    objtype (4-bits): object type
>> +      valid values in binary are 1000 (regular file), 1010 (symbolic
>> +      link) and 1110 (gitlink)
>> +
>> +    3-bit unused
>> +
>> +    permission (9-bits): unix permission. Only 0755 and 0644 are valid
>> +      for regular files. Symbolic links and gitlinks have value 0 in
>> +      this field.
>> +
>> +  mtimes (32-bits): mtime seconds, the last time a file's data changed
>> +    this is stat(2) data
>> +
>> +  mtimens (32-bits): mtime nanosecond fractions
>> +    this is stat(2) data
>> +
>> +  file size (32-bits): The on-disk size, trucated to 32-bit.
>> +    this is stat(2) data
>> +
>> +  statcrc (32-bits): crc32 checksum over ctime seconds, ctime
>> +    nanoseconds, ino, dev, uid, gid (All stat(2) data
>> +    except mtime and file size). If the statcrc is 0 it will
>> +    be ignored. [7]
>> +
>> +  objhash (160-bits): SHA-1 for the represented object
>> +
>> +  entrycrc (32-bits): crc32 checksum for the file entry. The crc code
>> +    includes the offset to the offset to the file, relative to the
>> +    beginning of the file.
>
> Question about the possibility of updating index file directly. If git
> updates a few fields of an entry (but not entrycrc yet) and crashes,
> the entry would become corrupt because its entrycrc does not match the
> content. What do we do? Do we need to save a copy of the entry
> somewhere in the index file (maybe in the conflict data section), so
> that the reader can recover the index? Losing the index because of
> bugs is big deal in my opinion. pre-v5 never faces this because we
> keep the original copy til the end.
>
> Maybe entrycrc should not cover stat fields and statcrc. It would make
> refreshing safer. If the above happens during refresh, only statcrc is
> corrupt and we can just refresh the entry. entrycrc still says the
> other fields are good (and they are).

The original idea was to change the lock-file for partial writing to
make it work for this case.  The exact structure of the file still has
to be defined, but generally it would be done in the following steps:

  1. Write the changed entry to the lock-file
  2. Change the entry in the index
  3. If we succeed delete the lock-file (commit the transaction)

If git crashes, and leaves the index corrupted, we can recover the
information from the lock-file and write the new information to the
index file and then delete the lock-file.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5.5/22] Add documentation for the index api
  2013-07-11 11:30                 ` Thomas Gummerer
@ 2013-07-11 11:42                   ` Duy Nguyen
  2013-07-11 12:27                     ` Duy Nguyen
  0 siblings, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-11 11:42 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

On Thu, Jul 11, 2013 at 6:30 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> Duy Nguyen <pclouds@gmail.com> writes:
>
>> On Wed, Jul 10, 2013 at 3:10 AM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>>>> If you happen to know that certain entries match the given pathspec,
>>>> you could help the caller avoid match_pathspec'ing again by set a bit
>>>> in ce_flags.
>>>
>>> I currently don't know which entries do match the pathspec from just
>>> reading the index file, additional calls would be needed.  I don't think
>>> that would be worth the overhead.
>>
>> Yeah I now see that you select what to load in v5 with the adjusted
>> pathspec, not the input pathspec. Originally I thought you match the
>> input pathspec against every file entry in the index :P Your adjusted
>> pathspec looks like what common_prefix is for. It's cheaper than
>> creating adjusted_pathspec from match_pathspec and reduces loading in
>> major cases, where glob is not used.
>>
>> Still, creating an adjusted pathspec this way looks iffy. You need to
>> understand pathspec in order to strip the filename part out to match
>> the directory match only. An alternative is use
>> tree_entry_interesting. It goes along well with tree traversal and can
>> be used to match directories with original pathspec. Once you see it
>> matches an entry in a directory, you could skip matching the rest of
>> the files and load the whole directory. read_index_filtered_v5 and
>> read_entries may need some tweaking though. I'll try it and post a
>> patch later if I succeed.
>
> Hrm, I played around a bit with this idea, but I couldn't figure out how
> to make it work.  For it to work we would still have to load some
> entries in a directory at least?  Or is there a way to match the
> directories, which I just haven't figured out yet?

Yes you have to load some entries first. Even if a directory does not
match, we only know until at least the first file in the directory. OK
there might be problems because tree_entry_interesting expects all
entries in a directory to be memcmp sorted, without trailing slash for
subdirectories. I need to check again if v5 sort order is compatible..
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 13/22] documentation: add documentation of the index-v5 file format
  2013-07-11 11:39     ` Thomas Gummerer
@ 2013-07-11 11:47       ` Duy Nguyen
  2013-07-11 12:26         ` Thomas Gummerer
  0 siblings, 1 reply; 51+ messages in thread
From: Duy Nguyen @ 2013-07-11 11:47 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

On Thu, Jul 11, 2013 at 6:39 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> Question about the possibility of updating index file directly. If git
>> updates a few fields of an entry (but not entrycrc yet) and crashes,
>> the entry would become corrupt because its entrycrc does not match the
>> content. What do we do? Do we need to save a copy of the entry
>> somewhere in the index file (maybe in the conflict data section), so
>> that the reader can recover the index? Losing the index because of
>> bugs is big deal in my opinion. pre-v5 never faces this because we
>> keep the original copy til the end.
>>
>> Maybe entrycrc should not cover stat fields and statcrc. It would make
>> refreshing safer. If the above happens during refresh, only statcrc is
>> corrupt and we can just refresh the entry. entrycrc still says the
>> other fields are good (and they are).
>
> The original idea was to change the lock-file for partial writing to
> make it work for this case.  The exact structure of the file still has
> to be defined, but generally it would be done in the following steps:
>
>   1. Write the changed entry to the lock-file
>   2. Change the entry in the index
>   3. If we succeed delete the lock-file (commit the transaction)
>
> If git crashes, and leaves the index corrupted, we can recover the
> information from the lock-file and write the new information to the
> index file and then delete the lock-file.

Ah makes sense. Still concerned about refreshing though. Updated files
are usually few while refreshed files could be a lot more, increasing
the cost at #1.
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 13/22] documentation: add documentation of the index-v5 file format
  2013-07-11 11:47       ` Duy Nguyen
@ 2013-07-11 12:26         ` Thomas Gummerer
  2013-07-11 12:50           ` Duy Nguyen
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gummerer @ 2013-07-11 12:26 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

Duy Nguyen <pclouds@gmail.com> writes:

> On Thu, Jul 11, 2013 at 6:39 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>>> Question about the possibility of updating index file directly. If git
>>> updates a few fields of an entry (but not entrycrc yet) and crashes,
>>> the entry would become corrupt because its entrycrc does not match the
>>> content. What do we do? Do we need to save a copy of the entry
>>> somewhere in the index file (maybe in the conflict data section), so
>>> that the reader can recover the index? Losing the index because of
>>> bugs is big deal in my opinion. pre-v5 never faces this because we
>>> keep the original copy til the end.
>>>
>>> Maybe entrycrc should not cover stat fields and statcrc. It would make
>>> refreshing safer. If the above happens during refresh, only statcrc is
>>> corrupt and we can just refresh the entry. entrycrc still says the
>>> other fields are good (and they are).
>>
>> The original idea was to change the lock-file for partial writing to
>> make it work for this case.  The exact structure of the file still has
>> to be defined, but generally it would be done in the following steps:
>>
>>   1. Write the changed entry to the lock-file
>>   2. Change the entry in the index
>>   3. If we succeed delete the lock-file (commit the transaction)
>>
>> If git crashes, and leaves the index corrupted, we can recover the
>> information from the lock-file and write the new information to the
>> index file and then delete the lock-file.
>
> Ah makes sense. Still concerned about refreshing though. Updated files
> are usually few while refreshed files could be a lot more, increasing
> the cost at #1.

Any idea how common refreshing a big part of the cache is?  If it's not
to common, I'd prefer to leave the stat data and stat crc in the
entrycrc, as we can inform the user if something is wrong with the
index, be it from git failing, or from disk corruption.

On the other hand if refresh_cache is relatively common and usually
changes a big part of the index we should leave them out, as git can
still run correctly with incorrect stat data, but takes a little longer,
because it may have to check the file contents.  That will be trade-off
to make here.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5.5/22] Add documentation for the index api
  2013-07-11 11:42                   ` Duy Nguyen
@ 2013-07-11 12:27                     ` Duy Nguyen
  0 siblings, 0 replies; 51+ messages in thread
From: Duy Nguyen @ 2013-07-11 12:27 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

On Thu, Jul 11, 2013 at 6:42 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> On Thu, Jul 11, 2013 at 6:30 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>> Duy Nguyen <pclouds@gmail.com> writes:
>> Hrm, I played around a bit with this idea, but I couldn't figure out how
>> to make it work.  For it to work we would still have to load some
>> entries in a directory at least?  Or is there a way to match the
>> directories, which I just haven't figured out yet?
>
> Yes you have to load some entries first. Even if a directory does not
> match, we only know until at least the first file in the directory. OK
> there might be problems because tree_entry_interesting expects all
> entries in a directory to be memcmp sorted, without trailing slash for
> subdirectories. I need to check again if v5 sort order is compatible..

Not gonna work (at least not simple) because we have to mix
directories and files again. The way directory entries are ordered
makes it hard (or less efficient) to get the list of immediate subdirs
of a dir. I think I understand now why you need adjusted_pathspec..
--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 13/22] documentation: add documentation of the index-v5 file format
  2013-07-11 12:26         ` Thomas Gummerer
@ 2013-07-11 12:50           ` Duy Nguyen
  0 siblings, 0 replies; 51+ messages in thread
From: Duy Nguyen @ 2013-07-11 12:50 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Git Mailing List, Thomas Rast, Michael Haggerty, Junio C Hamano,
	Robin Rosenberg

On Thu, Jul 11, 2013 at 7:26 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
> Duy Nguyen <pclouds@gmail.com> writes:
>
>> On Thu, Jul 11, 2013 at 6:39 PM, Thomas Gummerer <t.gummerer@gmail.com> wrote:
>>>> Question about the possibility of updating index file directly. If git
>>>> updates a few fields of an entry (but not entrycrc yet) and crashes,
>>>> the entry would become corrupt because its entrycrc does not match the
>>>> content. What do we do? Do we need to save a copy of the entry
>>>> somewhere in the index file (maybe in the conflict data section), so
>>>> that the reader can recover the index? Losing the index because of
>>>> bugs is big deal in my opinion. pre-v5 never faces this because we
>>>> keep the original copy til the end.
>>>>
>>>> Maybe entrycrc should not cover stat fields and statcrc. It would make
>>>> refreshing safer. If the above happens during refresh, only statcrc is
>>>> corrupt and we can just refresh the entry. entrycrc still says the
>>>> other fields are good (and they are).
>>>
>>> The original idea was to change the lock-file for partial writing to
>>> make it work for this case.  The exact structure of the file still has
>>> to be defined, but generally it would be done in the following steps:
>>>
>>>   1. Write the changed entry to the lock-file
>>>   2. Change the entry in the index
>>>   3. If we succeed delete the lock-file (commit the transaction)
>>>
>>> If git crashes, and leaves the index corrupted, we can recover the
>>> information from the lock-file and write the new information to the
>>> index file and then delete the lock-file.
>>
>> Ah makes sense. Still concerned about refreshing though. Updated files
>> are usually few while refreshed files could be a lot more, increasing
>> the cost at #1.
>
> Any idea how common refreshing a big part of the cache is?

No, probably not common. Anyone who does "find|xargs touch" deserves
to be punished. Files can be edited, then reverted by an editor, but
there should not be many of those. The only sensible case is "git
checkout <path>" with lots of modified files. But that can't happen
often.

> If it's not to common, I'd prefer to leave the stat data and stat crc in the
> entrycrc, as we can inform the user if something is wrong with the
> index, be it from git failing, or from disk corruption.
>
> On the other hand if refresh_cache is relatively common and usually
> changes a big part of the index we should leave them out, as git can
> still run correctly with incorrect stat data, but takes a little longer,
> because it may have to check the file contents.  That will be trade-off
> to make here.



--
Duy

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2013-07-11 12:51 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-07  8:11 [PATCH 00/22] Index v5 Thomas Gummerer
2013-07-07  8:11 ` [PATCH 01/22] t2104: Don't fail for index versions other than [23] Thomas Gummerer
2013-07-07  8:11 ` [PATCH 02/22] read-cache: split index file version specific functionality Thomas Gummerer
2013-07-07  8:11 ` [PATCH 03/22] read-cache: move index v2 specific functions to their own file Thomas Gummerer
2013-07-07  8:11 ` [PATCH 04/22] read-cache: Re-read index if index file changed Thomas Gummerer
2013-07-07  8:11 ` [PATCH 05/22] read-cache: add index reading api Thomas Gummerer
2013-07-08  2:01   ` Duy Nguyen
2013-07-08 11:40     ` Thomas Gummerer
2013-07-08  2:19   ` Duy Nguyen
2013-07-08 11:20     ` Thomas Gummerer
2013-07-08 12:45       ` Duy Nguyen
2013-07-08 13:37         ` Thomas Gummerer
2013-07-08 20:54         ` [PATCH 5.5/22] Add documentation for the index api Thomas Gummerer
2013-07-09 15:42           ` Duy Nguyen
2013-07-09 20:10             ` Thomas Gummerer
2013-07-10  5:28               ` Duy Nguyen
2013-07-11 11:30                 ` Thomas Gummerer
2013-07-11 11:42                   ` Duy Nguyen
2013-07-11 12:27                     ` Duy Nguyen
2013-07-08 16:36   ` [PATCH 05/22] read-cache: add index reading api Junio C Hamano
2013-07-08 20:10     ` Thomas Gummerer
2013-07-08 23:09       ` Junio C Hamano
2013-07-09 20:13         ` Thomas Gummerer
2013-07-07  8:11 ` [PATCH 06/22] make sure partially read index is not changed Thomas Gummerer
2013-07-08 16:31   ` Junio C Hamano
2013-07-08 18:33     ` Thomas Gummerer
2013-07-07  8:11 ` [PATCH 07/22] dir.c: use index api Thomas Gummerer
2013-07-07  8:11 ` [PATCH 08/22] tree.c: " Thomas Gummerer
2013-07-07  8:11 ` [PATCH 09/22] name-hash.c: " Thomas Gummerer
2013-07-07  8:11 ` [PATCH 10/22] grep.c: Use " Thomas Gummerer
2013-07-07  8:11 ` [PATCH 11/22] ls-files.c: use the " Thomas Gummerer
2013-07-07  8:11 ` [PATCH 12/22] read-cache: make read_blob_data_from_index use " Thomas Gummerer
2013-07-07  8:11 ` [PATCH 13/22] documentation: add documentation of the index-v5 file format Thomas Gummerer
2013-07-11 10:39   ` Duy Nguyen
2013-07-11 11:39     ` Thomas Gummerer
2013-07-11 11:47       ` Duy Nguyen
2013-07-11 12:26         ` Thomas Gummerer
2013-07-11 12:50           ` Duy Nguyen
2013-07-07  8:11 ` [PATCH 14/22] read-cache: make in-memory format aware of stat_crc Thomas Gummerer
2013-07-07  8:11 ` [PATCH 15/22] read-cache: read index-v5 Thomas Gummerer
2013-07-07 20:18   ` Eric Sunshine
2013-07-08 11:40     ` Thomas Gummerer
2013-07-07  8:11 ` [PATCH 16/22] read-cache: read resolve-undo data Thomas Gummerer
2013-07-07  8:11 ` [PATCH 17/22] read-cache: read cache-tree in index-v5 Thomas Gummerer
2013-07-07 20:41   ` Eric Sunshine
2013-07-07  8:11 ` [PATCH 18/22] read-cache: write index-v5 Thomas Gummerer
2013-07-07 20:43   ` Eric Sunshine
2013-07-07  8:11 ` [PATCH 19/22] read-cache: write index-v5 cache-tree data Thomas Gummerer
2013-07-07  8:11 ` [PATCH 20/22] read-cache: write resolve-undo data for index-v5 Thomas Gummerer
2013-07-07  8:11 ` [PATCH 21/22] update-index.c: rewrite index when index-version is given Thomas Gummerer
2013-07-07  8:12 ` [PATCH 22/22] p0003-index.sh: add perf test for the index formats Thomas Gummerer

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for the project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git