git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
To: git@vger.kernel.org
Cc: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Subject: [PATCH] Speed up sparse checkout when $GIT_DIR/info/sparse-checkout is unchanged
Date: Mon, 11 Jul 2016 20:15:32 +0200	[thread overview]
Message-ID: <20160711181532.20682-1-pclouds@gmail.com> (raw)

When a "tree unpacking" operation is needed, which is part of
switching branches using "git checkout", the following happens in a
sparse checkout:

1) Run all existing entries through $GIT_DIR/info/sparse-checkout,
   mark entries that are to-be-excluded or to-be-included.

2) Do n-way merge stuff to add, modify and delete entries.

3) Run all new entries added at step 2 through
   $GIT_DIR/info/sparse-checkout, mark entries that are to-be-excluded
   or to-be-included.

4) Compare the current excluded/include status with the to-be-status
   from steps 1 and 3, delete newly excluded entries from worktree and
   add newly included ones to worktree.

The "all existing entries" number in step 1 does not scale well. As
worktree gets bigger (or more sparse patterns are added), step 1 runs
slower. Which does not help because large worktrees are the reason
sparse-checkout is used, to keep the real worktree small again.

If we know that $GIT_DIR/info/sparse-checkout has not changed, we know
that running checking again would result in the exact same
included/excluded as recorded in the current index because
"sparse-checkout" is the only input to the exclude machinery. In this
case, marking the to-be-status is simply copying the current status
over, which is a lot faster.

The time breakdown of "git checkout" (no arguments) on webkit.git
(100k files) with a sparse checkout file of 4 negative patterns is
like this, where "sparse checkout loop #1" takes about 10% execution
time, which is the time saved after this patch.

read-cache.c:1661       performance: 0.057816104 s: read cache .git/index
files-backend.c:1097    performance: 0.000023980 s: read packed refs
preload-index.c:104     performance: 0.039178367 s: preload index
read-cache.c:1260       performance: 0.002700730 s: refresh index
name-hash.c:128         performance: 0.030409968 s: initialize name hash

unpack-trees.c:1173     performance: 0.100353572 s: sparse checkout loop #1

cache-tree.c:431        performance: 0.137213472 s: cache_tree_update
unpack-trees.c:1305     performance: 0.648923590 s: unpack_trees
read-cache.c:2139       performance: 0.074800165 s: write index, changed mask = 28
unpack-trees.c:1305     performance: 0.137108835 s: unpack_trees
diff-lib.c:506          performance: 0.137152238 s: diff-index
trace.c:420             performance: 0.972682413 s: git command: 'git' 'checkout'

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 I mentioned about this some time ago and finally got curious enough
 to try out. The saving is in the signficant range in my opinion, but
 real world effect is probably lower (or much higher if you have so
 many patterns in $GIT_DIR/info/sparse-checkout)

 Note that both cache_tree_update and sparse checkout loop #1 are part
 of unpack_trees() so actual time spent on this function is more like
 0.4s. It's still a lot, but then this function is very scary to
 optimize.

 Documentation/technical/index-format.txt |  6 +++++
 cache.h                                  |  2 ++
 read-cache.c                             | 22 ++++++++++++++++-
 unpack-trees.c                           | 42 ++++++++++++++++++++++++++++++--
 4 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index ade0b0c..3b0770a 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -295,3 +295,9 @@ The remaining data of each directory block is grouped by type:
     in the previous ewah bitmap.
 
   - One NUL.
+
+== Sparse checkout cache
+
+  Sparse checkout extension saves the 20 bytes SHA-1 hash of
+  $GIT_DIR/info/sparse-checkout at the time it is applied to the
+  index.
diff --git a/cache.h b/cache.h
index f1dc289..cc4c2b1 100644
--- a/cache.h
+++ b/cache.h
@@ -320,6 +320,7 @@ static inline unsigned int canon_mode(unsigned int mode)
 #define CACHE_TREE_CHANGED	(1 << 5)
 #define SPLIT_INDEX_ORDERED	(1 << 6)
 #define UNTRACKED_CHANGED	(1 << 7)
+#define SPARSE_CHECKOUT_CHANGED	(1 << 8)
 
 struct split_index;
 struct untracked_cache;
@@ -338,6 +339,7 @@ struct index_state {
 	struct hashmap dir_hash;
 	unsigned char sha1[20];
 	struct untracked_cache *untracked;
+	unsigned char sparse_checkout[20];
 };
 
 extern struct index_state the_index;
diff --git a/read-cache.c b/read-cache.c
index db27766..371d2c7 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -40,11 +40,13 @@ static struct cache_entry *refresh_cache_entry(struct cache_entry *ce,
 #define CACHE_EXT_RESOLVE_UNDO 0x52455543 /* "REUC" */
 #define CACHE_EXT_LINK 0x6c696e6b	  /* "link" */
 #define CACHE_EXT_UNTRACKED 0x554E5452	  /* "UNTR" */
+#define CACHE_EXT_SPARSE 0x5350434F	  /* "SPCO" */
 
 /* changes that can be kept in $GIT_DIR/index (basically all extensions) */
 #define EXTMASK (RESOLVE_UNDO_CHANGED | CACHE_TREE_CHANGED | \
 		 CE_ENTRY_ADDED | CE_ENTRY_REMOVED | CE_ENTRY_CHANGED | \
-		 SPLIT_INDEX_ORDERED | UNTRACKED_CHANGED)
+		 SPLIT_INDEX_ORDERED | UNTRACKED_CHANGED | \
+		 SPARSE_CHECKOUT_CHANGED)
 
 struct index_state the_index;
 static const char *alternate_index_output;
@@ -1384,6 +1386,11 @@ static int read_index_extension(struct index_state *istate,
 	case CACHE_EXT_UNTRACKED:
 		istate->untracked = read_untracked_extension(data, sz);
 		break;
+	case CACHE_EXT_SPARSE:
+		if (sz != sizeof(istate->sparse_checkout))
+			return error("bad %.4s extension", ext);
+		hashcpy(istate->sparse_checkout, data);
+		break;
 	default:
 		if (*ext < 'A' || 'Z' < *ext)
 			return error("index uses %.4s extension, which we do not understand",
@@ -1704,6 +1711,7 @@ int discard_index(struct index_state *istate)
 	discard_split_index(istate);
 	free_untracked_cache(istate->untracked);
 	istate->untracked = NULL;
+	hashclr(&istate->sparse_checkout);
 	return 0;
 }
 
@@ -2101,6 +2109,18 @@ static int do_write_index(struct index_state *istate, int newfd,
 		if (err)
 			return -1;
 	}
+	if (!strip_extensions && !is_null_sha1(istate->sparse_checkout)) {
+		struct strbuf sb = STRBUF_INIT;
+
+		strbuf_add(&sb, istate->sparse_checkout,
+			   sizeof(istate->sparse_checkout));
+		err = write_index_ext_header(&c, newfd, CACHE_EXT_SPARSE,
+					     sb.len) < 0 ||
+			ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		strbuf_release(&sb);
+		if (err)
+			return -1;
+	}
 
 	if (ce_flush(&c, newfd, istate->sha1) || fstat(newfd, &st))
 		return -1;
diff --git a/unpack-trees.c b/unpack-trees.c
index 6bc9512..f3916a9 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1080,6 +1080,25 @@ static void mark_new_skip_worktree(struct exclude_list *el,
 		       select_flag, skip_wt_flag, el);
 }
 
+static void get_sparse_checkout_hash(unsigned char *sha1)
+{
+	struct stat st;
+	int fd;
+
+	hashclr(sha1);
+	fd = open(git_path("info/sparse-checkout"), O_RDONLY);
+	if (fd == -1)
+		return;
+	if (fstat(fd, &st)) {
+		close(fd);
+		return;
+	}
+	if (index_fd(sha1, fd, &st, OBJ_BLOB,
+		     git_path("info/sparse-checkout"), 0) < 0)
+		hashclr(sha1);
+	close(fd);
+}
+
 static int verify_absent(const struct cache_entry *,
 			 enum unpack_trees_error_types,
 			 struct unpack_trees_options *);
@@ -1094,6 +1113,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	int i, ret;
 	static struct cache_entry *dfc;
 	struct exclude_list el;
+	unsigned char sparse_checkout_hash[20];
 
 	if (len > MAX_UNPACK_TREES)
 		die("unpack_trees takes at most %d trees", MAX_UNPACK_TREES);
@@ -1131,8 +1151,21 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	/*
 	 * Sparse checkout loop #1: set NEW_SKIP_WORKTREE on existing entries
 	 */
-	if (!o->skip_sparse_checkout)
-		mark_new_skip_worktree(o->el, o->src_index, 0, CE_NEW_SKIP_WORKTREE);
+	if (!o->skip_sparse_checkout) {
+		get_sparse_checkout_hash(sparse_checkout_hash);
+
+		if (!is_null_sha1(sparse_checkout_hash) &&
+		    !hashcmp(o->src_index->sparse_checkout, sparse_checkout_hash)) {
+			struct index_state *istate = o->src_index;
+			for (i = 0; i < istate->cache_nr; i++) {
+				struct cache_entry *ce = istate->cache[i];
+				if (ce_skip_worktree(ce))
+					ce->ce_flags |= CE_NEW_SKIP_WORKTREE;
+			}
+		} else
+			mark_new_skip_worktree(o->el, o->src_index,
+					       0, CE_NEW_SKIP_WORKTREE);
+	}
 
 	if (!dfc)
 		dfc = xcalloc(1, cache_entry_size(0));
@@ -1236,6 +1269,11 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 			ret = unpack_failed(o, "Sparse checkout leaves no entry on working directory");
 			goto done;
 		}
+
+		if (o->dst_index && !is_null_sha1(sparse_checkout_hash)) {
+			hashcpy(o->result.sparse_checkout, sparse_checkout_hash);
+			o->result.cache_changed |= SPARSE_CHECKOUT_CHANGED;
+		}
 	}
 
 	o->src_index = NULL;
-- 
2.8.2.537.g0965dd9


             reply	other threads:[~2016-07-11 18:15 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-11 18:15 Nguyễn Thái Ngọc Duy [this message]
2016-08-13  8:37 ` [PATCH] Speed up sparse checkout when $GIT_DIR/info/sparse-checkout is unchanged Duy Nguyen
2016-08-24 21:17   ` Ben Peart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160711181532.20682-1-pclouds@gmail.com \
    --to=pclouds@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).