git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] Speed up sparse checkout when $GIT_DIR/info/sparse-checkout is unchanged
@ 2016-07-11 18:15 Nguyễn Thái Ngọc Duy
  2016-08-13  8:37 ` Duy Nguyen
  0 siblings, 1 reply; 3+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2016-07-11 18:15 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

When a "tree unpacking" operation is needed, which is part of
switching branches using "git checkout", the following happens in a
sparse checkout:

1) Run all existing entries through $GIT_DIR/info/sparse-checkout,
   mark entries that are to-be-excluded or to-be-included.

2) Do n-way merge stuff to add, modify and delete entries.

3) Run all new entries added at step 2 through
   $GIT_DIR/info/sparse-checkout, mark entries that are to-be-excluded
   or to-be-included.

4) Compare the current excluded/include status with the to-be-status
   from steps 1 and 3, delete newly excluded entries from worktree and
   add newly included ones to worktree.

The "all existing entries" number in step 1 does not scale well. As
worktree gets bigger (or more sparse patterns are added), step 1 runs
slower. Which does not help because large worktrees are the reason
sparse-checkout is used, to keep the real worktree small again.

If we know that $GIT_DIR/info/sparse-checkout has not changed, we know
that running checking again would result in the exact same
included/excluded as recorded in the current index because
"sparse-checkout" is the only input to the exclude machinery. In this
case, marking the to-be-status is simply copying the current status
over, which is a lot faster.

The time breakdown of "git checkout" (no arguments) on webkit.git
(100k files) with a sparse checkout file of 4 negative patterns is
like this, where "sparse checkout loop #1" takes about 10% execution
time, which is the time saved after this patch.

read-cache.c:1661       performance: 0.057816104 s: read cache .git/index
files-backend.c:1097    performance: 0.000023980 s: read packed refs
preload-index.c:104     performance: 0.039178367 s: preload index
read-cache.c:1260       performance: 0.002700730 s: refresh index
name-hash.c:128         performance: 0.030409968 s: initialize name hash

unpack-trees.c:1173     performance: 0.100353572 s: sparse checkout loop #1

cache-tree.c:431        performance: 0.137213472 s: cache_tree_update
unpack-trees.c:1305     performance: 0.648923590 s: unpack_trees
read-cache.c:2139       performance: 0.074800165 s: write index, changed mask = 28
unpack-trees.c:1305     performance: 0.137108835 s: unpack_trees
diff-lib.c:506          performance: 0.137152238 s: diff-index
trace.c:420             performance: 0.972682413 s: git command: 'git' 'checkout'

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 I mentioned about this some time ago and finally got curious enough
 to try out. The saving is in the signficant range in my opinion, but
 real world effect is probably lower (or much higher if you have so
 many patterns in $GIT_DIR/info/sparse-checkout)

 Note that both cache_tree_update and sparse checkout loop #1 are part
 of unpack_trees() so actual time spent on this function is more like
 0.4s. It's still a lot, but then this function is very scary to
 optimize.

 Documentation/technical/index-format.txt |  6 +++++
 cache.h                                  |  2 ++
 read-cache.c                             | 22 ++++++++++++++++-
 unpack-trees.c                           | 42 ++++++++++++++++++++++++++++++--
 4 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index ade0b0c..3b0770a 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -295,3 +295,9 @@ The remaining data of each directory block is grouped by type:
     in the previous ewah bitmap.
 
   - One NUL.
+
+== Sparse checkout cache
+
+  Sparse checkout extension saves the 20 bytes SHA-1 hash of
+  $GIT_DIR/info/sparse-checkout at the time it is applied to the
+  index.
diff --git a/cache.h b/cache.h
index f1dc289..cc4c2b1 100644
--- a/cache.h
+++ b/cache.h
@@ -320,6 +320,7 @@ static inline unsigned int canon_mode(unsigned int mode)
 #define CACHE_TREE_CHANGED	(1 << 5)
 #define SPLIT_INDEX_ORDERED	(1 << 6)
 #define UNTRACKED_CHANGED	(1 << 7)
+#define SPARSE_CHECKOUT_CHANGED	(1 << 8)
 
 struct split_index;
 struct untracked_cache;
@@ -338,6 +339,7 @@ struct index_state {
 	struct hashmap dir_hash;
 	unsigned char sha1[20];
 	struct untracked_cache *untracked;
+	unsigned char sparse_checkout[20];
 };
 
 extern struct index_state the_index;
diff --git a/read-cache.c b/read-cache.c
index db27766..371d2c7 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -40,11 +40,13 @@ static struct cache_entry *refresh_cache_entry(struct cache_entry *ce,
 #define CACHE_EXT_RESOLVE_UNDO 0x52455543 /* "REUC" */
 #define CACHE_EXT_LINK 0x6c696e6b	  /* "link" */
 #define CACHE_EXT_UNTRACKED 0x554E5452	  /* "UNTR" */
+#define CACHE_EXT_SPARSE 0x5350434F	  /* "SPCO" */
 
 /* changes that can be kept in $GIT_DIR/index (basically all extensions) */
 #define EXTMASK (RESOLVE_UNDO_CHANGED | CACHE_TREE_CHANGED | \
 		 CE_ENTRY_ADDED | CE_ENTRY_REMOVED | CE_ENTRY_CHANGED | \
-		 SPLIT_INDEX_ORDERED | UNTRACKED_CHANGED)
+		 SPLIT_INDEX_ORDERED | UNTRACKED_CHANGED | \
+		 SPARSE_CHECKOUT_CHANGED)
 
 struct index_state the_index;
 static const char *alternate_index_output;
@@ -1384,6 +1386,11 @@ static int read_index_extension(struct index_state *istate,
 	case CACHE_EXT_UNTRACKED:
 		istate->untracked = read_untracked_extension(data, sz);
 		break;
+	case CACHE_EXT_SPARSE:
+		if (sz != sizeof(istate->sparse_checkout))
+			return error("bad %.4s extension", ext);
+		hashcpy(istate->sparse_checkout, data);
+		break;
 	default:
 		if (*ext < 'A' || 'Z' < *ext)
 			return error("index uses %.4s extension, which we do not understand",
@@ -1704,6 +1711,7 @@ int discard_index(struct index_state *istate)
 	discard_split_index(istate);
 	free_untracked_cache(istate->untracked);
 	istate->untracked = NULL;
+	hashclr(&istate->sparse_checkout);
 	return 0;
 }
 
@@ -2101,6 +2109,18 @@ static int do_write_index(struct index_state *istate, int newfd,
 		if (err)
 			return -1;
 	}
+	if (!strip_extensions && !is_null_sha1(istate->sparse_checkout)) {
+		struct strbuf sb = STRBUF_INIT;
+
+		strbuf_add(&sb, istate->sparse_checkout,
+			   sizeof(istate->sparse_checkout));
+		err = write_index_ext_header(&c, newfd, CACHE_EXT_SPARSE,
+					     sb.len) < 0 ||
+			ce_write(&c, newfd, sb.buf, sb.len) < 0;
+		strbuf_release(&sb);
+		if (err)
+			return -1;
+	}
 
 	if (ce_flush(&c, newfd, istate->sha1) || fstat(newfd, &st))
 		return -1;
diff --git a/unpack-trees.c b/unpack-trees.c
index 6bc9512..f3916a9 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1080,6 +1080,25 @@ static void mark_new_skip_worktree(struct exclude_list *el,
 		       select_flag, skip_wt_flag, el);
 }
 
+static void get_sparse_checkout_hash(unsigned char *sha1)
+{
+	struct stat st;
+	int fd;
+
+	hashclr(sha1);
+	fd = open(git_path("info/sparse-checkout"), O_RDONLY);
+	if (fd == -1)
+		return;
+	if (fstat(fd, &st)) {
+		close(fd);
+		return;
+	}
+	if (index_fd(sha1, fd, &st, OBJ_BLOB,
+		     git_path("info/sparse-checkout"), 0) < 0)
+		hashclr(sha1);
+	close(fd);
+}
+
 static int verify_absent(const struct cache_entry *,
 			 enum unpack_trees_error_types,
 			 struct unpack_trees_options *);
@@ -1094,6 +1113,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	int i, ret;
 	static struct cache_entry *dfc;
 	struct exclude_list el;
+	unsigned char sparse_checkout_hash[20];
 
 	if (len > MAX_UNPACK_TREES)
 		die("unpack_trees takes at most %d trees", MAX_UNPACK_TREES);
@@ -1131,8 +1151,21 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	/*
 	 * Sparse checkout loop #1: set NEW_SKIP_WORKTREE on existing entries
 	 */
-	if (!o->skip_sparse_checkout)
-		mark_new_skip_worktree(o->el, o->src_index, 0, CE_NEW_SKIP_WORKTREE);
+	if (!o->skip_sparse_checkout) {
+		get_sparse_checkout_hash(sparse_checkout_hash);
+
+		if (!is_null_sha1(sparse_checkout_hash) &&
+		    !hashcmp(o->src_index->sparse_checkout, sparse_checkout_hash)) {
+			struct index_state *istate = o->src_index;
+			for (i = 0; i < istate->cache_nr; i++) {
+				struct cache_entry *ce = istate->cache[i];
+				if (ce_skip_worktree(ce))
+					ce->ce_flags |= CE_NEW_SKIP_WORKTREE;
+			}
+		} else
+			mark_new_skip_worktree(o->el, o->src_index,
+					       0, CE_NEW_SKIP_WORKTREE);
+	}
 
 	if (!dfc)
 		dfc = xcalloc(1, cache_entry_size(0));
@@ -1236,6 +1269,11 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 			ret = unpack_failed(o, "Sparse checkout leaves no entry on working directory");
 			goto done;
 		}
+
+		if (o->dst_index && !is_null_sha1(sparse_checkout_hash)) {
+			hashcpy(o->result.sparse_checkout, sparse_checkout_hash);
+			o->result.cache_changed |= SPARSE_CHECKOUT_CHANGED;
+		}
 	}
 
 	o->src_index = NULL;
-- 
2.8.2.537.g0965dd9


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] Speed up sparse checkout when $GIT_DIR/info/sparse-checkout is unchanged
  2016-07-11 18:15 [PATCH] Speed up sparse checkout when $GIT_DIR/info/sparse-checkout is unchanged Nguyễn Thái Ngọc Duy
@ 2016-08-13  8:37 ` Duy Nguyen
  2016-08-24 21:17   ` Ben Peart
  0 siblings, 1 reply; 3+ messages in thread
From: Duy Nguyen @ 2016-08-13  8:37 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano; +Cc: Nguyễn Thái Ngọc Duy

Ping..

On Tue, Jul 12, 2016 at 1:15 AM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote:
> When a "tree unpacking" operation is needed, which is part of
> switching branches using "git checkout", the following happens in a
> sparse checkout:
>
> 1) Run all existing entries through $GIT_DIR/info/sparse-checkout,
>    mark entries that are to-be-excluded or to-be-included.
>
> 2) Do n-way merge stuff to add, modify and delete entries.
>
> 3) Run all new entries added at step 2 through
>    $GIT_DIR/info/sparse-checkout, mark entries that are to-be-excluded
>    or to-be-included.
>
> 4) Compare the current excluded/include status with the to-be-status
>    from steps 1 and 3, delete newly excluded entries from worktree and
>    add newly included ones to worktree.
>
> The "all existing entries" number in step 1 does not scale well. As
> worktree gets bigger (or more sparse patterns are added), step 1 runs
> slower. Which does not help because large worktrees are the reason
> sparse-checkout is used, to keep the real worktree small again.
>
> If we know that $GIT_DIR/info/sparse-checkout has not changed, we know
> that running checking again would result in the exact same
> included/excluded as recorded in the current index because
> "sparse-checkout" is the only input to the exclude machinery. In this
> case, marking the to-be-status is simply copying the current status
> over, which is a lot faster.
>
> The time breakdown of "git checkout" (no arguments) on webkit.git
> (100k files) with a sparse checkout file of 4 negative patterns is
> like this, where "sparse checkout loop #1" takes about 10% execution
> time, which is the time saved after this patch.
>
> read-cache.c:1661       performance: 0.057816104 s: read cache .git/index
> files-backend.c:1097    performance: 0.000023980 s: read packed refs
> preload-index.c:104     performance: 0.039178367 s: preload index
> read-cache.c:1260       performance: 0.002700730 s: refresh index
> name-hash.c:128         performance: 0.030409968 s: initialize name hash
>
> unpack-trees.c:1173     performance: 0.100353572 s: sparse checkout loop #1
>
> cache-tree.c:431        performance: 0.137213472 s: cache_tree_update
> unpack-trees.c:1305     performance: 0.648923590 s: unpack_trees
> read-cache.c:2139       performance: 0.074800165 s: write index, changed mask = 28
> unpack-trees.c:1305     performance: 0.137108835 s: unpack_trees
> diff-lib.c:506          performance: 0.137152238 s: diff-index
> trace.c:420             performance: 0.972682413 s: git command: 'git' 'checkout'
>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  I mentioned about this some time ago and finally got curious enough
>  to try out. The saving is in the signficant range in my opinion, but
>  real world effect is probably lower (or much higher if you have so
>  many patterns in $GIT_DIR/info/sparse-checkout)
>
>  Note that both cache_tree_update and sparse checkout loop #1 are part
>  of unpack_trees() so actual time spent on this function is more like
>  0.4s. It's still a lot, but then this function is very scary to
>  optimize.
>
>  Documentation/technical/index-format.txt |  6 +++++
>  cache.h                                  |  2 ++
>  read-cache.c                             | 22 ++++++++++++++++-
>  unpack-trees.c                           | 42 ++++++++++++++++++++++++++++++--
>  4 files changed, 69 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
> index ade0b0c..3b0770a 100644
> --- a/Documentation/technical/index-format.txt
> +++ b/Documentation/technical/index-format.txt
> @@ -295,3 +295,9 @@ The remaining data of each directory block is grouped by type:
>      in the previous ewah bitmap.
>
>    - One NUL.
> +
> +== Sparse checkout cache
> +
> +  Sparse checkout extension saves the 20 bytes SHA-1 hash of
> +  $GIT_DIR/info/sparse-checkout at the time it is applied to the
> +  index.
> diff --git a/cache.h b/cache.h
> index f1dc289..cc4c2b1 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -320,6 +320,7 @@ static inline unsigned int canon_mode(unsigned int mode)
>  #define CACHE_TREE_CHANGED     (1 << 5)
>  #define SPLIT_INDEX_ORDERED    (1 << 6)
>  #define UNTRACKED_CHANGED      (1 << 7)
> +#define SPARSE_CHECKOUT_CHANGED        (1 << 8)
>
>  struct split_index;
>  struct untracked_cache;
> @@ -338,6 +339,7 @@ struct index_state {
>         struct hashmap dir_hash;
>         unsigned char sha1[20];
>         struct untracked_cache *untracked;
> +       unsigned char sparse_checkout[20];
>  };
>
>  extern struct index_state the_index;
> diff --git a/read-cache.c b/read-cache.c
> index db27766..371d2c7 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -40,11 +40,13 @@ static struct cache_entry *refresh_cache_entry(struct cache_entry *ce,
>  #define CACHE_EXT_RESOLVE_UNDO 0x52455543 /* "REUC" */
>  #define CACHE_EXT_LINK 0x6c696e6b        /* "link" */
>  #define CACHE_EXT_UNTRACKED 0x554E5452   /* "UNTR" */
> +#define CACHE_EXT_SPARSE 0x5350434F      /* "SPCO" */
>
>  /* changes that can be kept in $GIT_DIR/index (basically all extensions) */
>  #define EXTMASK (RESOLVE_UNDO_CHANGED | CACHE_TREE_CHANGED | \
>                  CE_ENTRY_ADDED | CE_ENTRY_REMOVED | CE_ENTRY_CHANGED | \
> -                SPLIT_INDEX_ORDERED | UNTRACKED_CHANGED)
> +                SPLIT_INDEX_ORDERED | UNTRACKED_CHANGED | \
> +                SPARSE_CHECKOUT_CHANGED)
>
>  struct index_state the_index;
>  static const char *alternate_index_output;
> @@ -1384,6 +1386,11 @@ static int read_index_extension(struct index_state *istate,
>         case CACHE_EXT_UNTRACKED:
>                 istate->untracked = read_untracked_extension(data, sz);
>                 break;
> +       case CACHE_EXT_SPARSE:
> +               if (sz != sizeof(istate->sparse_checkout))
> +                       return error("bad %.4s extension", ext);
> +               hashcpy(istate->sparse_checkout, data);
> +               break;
>         default:
>                 if (*ext < 'A' || 'Z' < *ext)
>                         return error("index uses %.4s extension, which we do not understand",
> @@ -1704,6 +1711,7 @@ int discard_index(struct index_state *istate)
>         discard_split_index(istate);
>         free_untracked_cache(istate->untracked);
>         istate->untracked = NULL;
> +       hashclr(&istate->sparse_checkout);
>         return 0;
>  }
>
> @@ -2101,6 +2109,18 @@ static int do_write_index(struct index_state *istate, int newfd,
>                 if (err)
>                         return -1;
>         }
> +       if (!strip_extensions && !is_null_sha1(istate->sparse_checkout)) {
> +               struct strbuf sb = STRBUF_INIT;
> +
> +               strbuf_add(&sb, istate->sparse_checkout,
> +                          sizeof(istate->sparse_checkout));
> +               err = write_index_ext_header(&c, newfd, CACHE_EXT_SPARSE,
> +                                            sb.len) < 0 ||
> +                       ce_write(&c, newfd, sb.buf, sb.len) < 0;
> +               strbuf_release(&sb);
> +               if (err)
> +                       return -1;
> +       }
>
>         if (ce_flush(&c, newfd, istate->sha1) || fstat(newfd, &st))
>                 return -1;
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 6bc9512..f3916a9 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -1080,6 +1080,25 @@ static void mark_new_skip_worktree(struct exclude_list *el,
>                        select_flag, skip_wt_flag, el);
>  }
>
> +static void get_sparse_checkout_hash(unsigned char *sha1)
> +{
> +       struct stat st;
> +       int fd;
> +
> +       hashclr(sha1);
> +       fd = open(git_path("info/sparse-checkout"), O_RDONLY);
> +       if (fd == -1)
> +               return;
> +       if (fstat(fd, &st)) {
> +               close(fd);
> +               return;
> +       }
> +       if (index_fd(sha1, fd, &st, OBJ_BLOB,
> +                    git_path("info/sparse-checkout"), 0) < 0)
> +               hashclr(sha1);
> +       close(fd);
> +}
> +
>  static int verify_absent(const struct cache_entry *,
>                          enum unpack_trees_error_types,
>                          struct unpack_trees_options *);
> @@ -1094,6 +1113,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>         int i, ret;
>         static struct cache_entry *dfc;
>         struct exclude_list el;
> +       unsigned char sparse_checkout_hash[20];
>
>         if (len > MAX_UNPACK_TREES)
>                 die("unpack_trees takes at most %d trees", MAX_UNPACK_TREES);
> @@ -1131,8 +1151,21 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>         /*
>          * Sparse checkout loop #1: set NEW_SKIP_WORKTREE on existing entries
>          */
> -       if (!o->skip_sparse_checkout)
> -               mark_new_skip_worktree(o->el, o->src_index, 0, CE_NEW_SKIP_WORKTREE);
> +       if (!o->skip_sparse_checkout) {
> +               get_sparse_checkout_hash(sparse_checkout_hash);
> +
> +               if (!is_null_sha1(sparse_checkout_hash) &&
> +                   !hashcmp(o->src_index->sparse_checkout, sparse_checkout_hash)) {
> +                       struct index_state *istate = o->src_index;
> +                       for (i = 0; i < istate->cache_nr; i++) {
> +                               struct cache_entry *ce = istate->cache[i];
> +                               if (ce_skip_worktree(ce))
> +                                       ce->ce_flags |= CE_NEW_SKIP_WORKTREE;
> +                       }
> +               } else
> +                       mark_new_skip_worktree(o->el, o->src_index,
> +                                              0, CE_NEW_SKIP_WORKTREE);
> +       }
>
>         if (!dfc)
>                 dfc = xcalloc(1, cache_entry_size(0));
> @@ -1236,6 +1269,11 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>                         ret = unpack_failed(o, "Sparse checkout leaves no entry on working directory");
>                         goto done;
>                 }
> +
> +               if (o->dst_index && !is_null_sha1(sparse_checkout_hash)) {
> +                       hashcpy(o->result.sparse_checkout, sparse_checkout_hash);
> +                       o->result.cache_changed |= SPARSE_CHECKOUT_CHANGED;
> +               }
>         }
>
>         o->src_index = NULL;
> --
> 2.8.2.537.g0965dd9
>



-- 
Duy

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [PATCH] Speed up sparse checkout when $GIT_DIR/info/sparse-checkout is unchanged
  2016-08-13  8:37 ` Duy Nguyen
@ 2016-08-24 21:17   ` Ben Peart
  0 siblings, 0 replies; 3+ messages in thread
From: Ben Peart @ 2016-08-24 21:17 UTC (permalink / raw)
  To: 'Duy Nguyen', 'Git Mailing List',
	'Junio C Hamano'

I didn't apply the patch and test it out but from just a code review, the logic behind and the design of this patch makes sense and it's a relatively small patch for the gain.  It's also a great example of the minimal amount of work required to add a new extension into the index.  Thank you for that.

For more extreme cases where there are large numbers of entries in the sparse-checkout file, another approach may be more helpful.  Our virtualization solution uses sparse checkout extensively.  The sparse-checkout file starts out empty and as files are hydrated locally, they are added to the sparse-checkout file.  This results in thousands of entries in the sparse-checkout file.   

To make that fast enough, we ended up creating a hashmap in mark_new_skip_worktree and then in clear_ce_flags_1 we use that hashmap to implement the pattern matching logic.  The hashmap lookup is dramatically faster than the current recursive and complex pattern matching via the excludes mechanism so easily supports very large numbers of entries.  

Note, these are extreme test cases that caused the unmodified git.exe to crash so I can't really give before/after comparisons:

With 35,000 entries in the sparse-checkout file, "git checkout -b xxx" took 42.5 seconds
With 100,000 entries in the sparse-checkout file, "git checkout -b xxx" took 42.5 seconds
With 3,279,254 entries in the sparse-checkout file, "git checkout -b xxx" took 1 min 38 seconds

Note the first two numbers are the same because the hashmap based pattern matching is so fast, 95% of the time is now spent in merge_working_tree.  Since the tip commit doesn’t change between the new and old branch and because a merge is done, the files in the working directory don’t change either so I believe there must be an optimization there but I haven't been able to figure one out that doesn't break a lot of other things.  

Ben

p.s.  Yes, I actually ran "git ls-files > .git/info/sparse-checkout" to test the extreme case. :)

-----Original Message-----
From: git-owner@vger.kernel.org [mailto:git-owner@vger.kernel.org] On Behalf Of Duy Nguyen
Sent: Saturday, August 13, 2016 4:37 AM
To: Git Mailing List <git@vger.kernel.org>; Junio C Hamano <gitster@pobox.com>
Cc: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Subject: Re: [PATCH] Speed up sparse checkout when $GIT_DIR/info/sparse-checkout is unchanged

Ping..

On Tue, Jul 12, 2016 at 1:15 AM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote:
> When a "tree unpacking" operation is needed, which is part of 
> switching branches using "git checkout", the following happens in a 
> sparse checkout:
>
> 1) Run all existing entries through $GIT_DIR/info/sparse-checkout,
>    mark entries that are to-be-excluded or to-be-included.
>
> 2) Do n-way merge stuff to add, modify and delete entries.
>
> 3) Run all new entries added at step 2 through
>    $GIT_DIR/info/sparse-checkout, mark entries that are to-be-excluded
>    or to-be-included.
>
> 4) Compare the current excluded/include status with the to-be-status
>    from steps 1 and 3, delete newly excluded entries from worktree and
>    add newly included ones to worktree.
>
> The "all existing entries" number in step 1 does not scale well. As 
> worktree gets bigger (or more sparse patterns are added), step 1 runs 
> slower. Which does not help because large worktrees are the reason 
> sparse-checkout is used, to keep the real worktree small again.
>
> If we know that $GIT_DIR/info/sparse-checkout has not changed, we know 
> that running checking again would result in the exact same 
> included/excluded as recorded in the current index because 
> "sparse-checkout" is the only input to the exclude machinery. In this 
> case, marking the to-be-status is simply copying the current status 
> over, which is a lot faster.
>
> The time breakdown of "git checkout" (no arguments) on webkit.git 
> (100k files) with a sparse checkout file of 4 negative patterns is 
> like this, where "sparse checkout loop #1" takes about 10% execution 
> time, which is the time saved after this patch.
>
> read-cache.c:1661       performance: 0.057816104 s: read cache .git/index
> files-backend.c:1097    performance: 0.000023980 s: read packed refs
> preload-index.c:104     performance: 0.039178367 s: preload index
> read-cache.c:1260       performance: 0.002700730 s: refresh index
> name-hash.c:128         performance: 0.030409968 s: initialize name hash
>
> unpack-trees.c:1173     performance: 0.100353572 s: sparse checkout loop #1
>
> cache-tree.c:431        performance: 0.137213472 s: cache_tree_update
> unpack-trees.c:1305     performance: 0.648923590 s: unpack_trees
> read-cache.c:2139       performance: 0.074800165 s: write index, changed mask = 28
> unpack-trees.c:1305     performance: 0.137108835 s: unpack_trees
> diff-lib.c:506          performance: 0.137152238 s: diff-index
> trace.c:420             performance: 0.972682413 s: git command: 'git' 'checkout'
>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  I mentioned about this some time ago and finally got curious enough  
> to try out. The saving is in the signficant range in my opinion, but  
> real world effect is probably lower (or much higher if you have so  
> many patterns in $GIT_DIR/info/sparse-checkout)
>
>  Note that both cache_tree_update and sparse checkout loop #1 are part  
> of unpack_trees() so actual time spent on this function is more like  
> 0.4s. It's still a lot, but then this function is very scary to  
> optimize.
>
>  Documentation/technical/index-format.txt |  6 +++++
>  cache.h                                  |  2 ++
>  read-cache.c                             | 22 ++++++++++++++++-
>  unpack-trees.c                           | 42 ++++++++++++++++++++++++++++++--
>  4 files changed, 69 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/technical/index-format.txt 
> b/Documentation/technical/index-format.txt
> index ade0b0c..3b0770a 100644
> --- a/Documentation/technical/index-format.txt
> +++ b/Documentation/technical/index-format.txt
> @@ -295,3 +295,9 @@ The remaining data of each directory block is grouped by type:
>      in the previous ewah bitmap.
>
>    - One NUL.
> +
> +== Sparse checkout cache
> +
> +  Sparse checkout extension saves the 20 bytes SHA-1 hash of  
> + $GIT_DIR/info/sparse-checkout at the time it is applied to the  
> + index.
> diff --git a/cache.h b/cache.h
> index f1dc289..cc4c2b1 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -320,6 +320,7 @@ static inline unsigned int canon_mode(unsigned int mode)
>  #define CACHE_TREE_CHANGED     (1 << 5)
>  #define SPLIT_INDEX_ORDERED    (1 << 6)
>  #define UNTRACKED_CHANGED      (1 << 7)
> +#define SPARSE_CHECKOUT_CHANGED        (1 << 8)
>
>  struct split_index;
>  struct untracked_cache;
> @@ -338,6 +339,7 @@ struct index_state {
>         struct hashmap dir_hash;
>         unsigned char sha1[20];
>         struct untracked_cache *untracked;
> +       unsigned char sparse_checkout[20];
>  };
>
>  extern struct index_state the_index;
> diff --git a/read-cache.c b/read-cache.c index db27766..371d2c7 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -40,11 +40,13 @@ static struct cache_entry 
> *refresh_cache_entry(struct cache_entry *ce,  #define CACHE_EXT_RESOLVE_UNDO 0x52455543 /* "REUC" */
>  #define CACHE_EXT_LINK 0x6c696e6b        /* "link" */
>  #define CACHE_EXT_UNTRACKED 0x554E5452   /* "UNTR" */
> +#define CACHE_EXT_SPARSE 0x5350434F      /* "SPCO" */
>
>  /* changes that can be kept in $GIT_DIR/index (basically all 
> extensions) */  #define EXTMASK (RESOLVE_UNDO_CHANGED | CACHE_TREE_CHANGED | \
>                  CE_ENTRY_ADDED | CE_ENTRY_REMOVED | CE_ENTRY_CHANGED | \
> -                SPLIT_INDEX_ORDERED | UNTRACKED_CHANGED)
> +                SPLIT_INDEX_ORDERED | UNTRACKED_CHANGED | \
> +                SPARSE_CHECKOUT_CHANGED)
>
>  struct index_state the_index;
>  static const char *alternate_index_output; @@ -1384,6 +1386,11 @@ 
> static int read_index_extension(struct index_state *istate,
>         case CACHE_EXT_UNTRACKED:
>                 istate->untracked = read_untracked_extension(data, sz);
>                 break;
> +       case CACHE_EXT_SPARSE:
> +               if (sz != sizeof(istate->sparse_checkout))
> +                       return error("bad %.4s extension", ext);
> +               hashcpy(istate->sparse_checkout, data);
> +               break;
>         default:
>                 if (*ext < 'A' || 'Z' < *ext)
>                         return error("index uses %.4s extension, which 
> we do not understand", @@ -1704,6 +1711,7 @@ int discard_index(struct index_state *istate)
>         discard_split_index(istate);
>         free_untracked_cache(istate->untracked);
>         istate->untracked = NULL;
> +       hashclr(&istate->sparse_checkout);
>         return 0;
>  }
>
> @@ -2101,6 +2109,18 @@ static int do_write_index(struct index_state *istate, int newfd,
>                 if (err)
>                         return -1;
>         }
> +       if (!strip_extensions && !is_null_sha1(istate->sparse_checkout)) {
> +               struct strbuf sb = STRBUF_INIT;
> +
> +               strbuf_add(&sb, istate->sparse_checkout,
> +                          sizeof(istate->sparse_checkout));
> +               err = write_index_ext_header(&c, newfd, CACHE_EXT_SPARSE,
> +                                            sb.len) < 0 ||
> +                       ce_write(&c, newfd, sb.buf, sb.len) < 0;
> +               strbuf_release(&sb);
> +               if (err)
> +                       return -1;
> +       }
>
>         if (ce_flush(&c, newfd, istate->sha1) || fstat(newfd, &st))
>                 return -1;
> diff --git a/unpack-trees.c b/unpack-trees.c index 6bc9512..f3916a9 
> 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -1080,6 +1080,25 @@ static void mark_new_skip_worktree(struct exclude_list *el,
>                        select_flag, skip_wt_flag, el);  }
>
> +static void get_sparse_checkout_hash(unsigned char *sha1) {
> +       struct stat st;
> +       int fd;
> +
> +       hashclr(sha1);
> +       fd = open(git_path("info/sparse-checkout"), O_RDONLY);
> +       if (fd == -1)
> +               return;
> +       if (fstat(fd, &st)) {
> +               close(fd);
> +               return;
> +       }
> +       if (index_fd(sha1, fd, &st, OBJ_BLOB,
> +                    git_path("info/sparse-checkout"), 0) < 0)
> +               hashclr(sha1);
> +       close(fd);
> +}
> +
>  static int verify_absent(const struct cache_entry *,
>                          enum unpack_trees_error_types,
>                          struct unpack_trees_options *); @@ -1094,6 
> +1113,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>         int i, ret;
>         static struct cache_entry *dfc;
>         struct exclude_list el;
> +       unsigned char sparse_checkout_hash[20];
>
>         if (len > MAX_UNPACK_TREES)
>                 die("unpack_trees takes at most %d trees", 
> MAX_UNPACK_TREES); @@ -1131,8 +1151,21 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>         /*
>          * Sparse checkout loop #1: set NEW_SKIP_WORKTREE on existing entries
>          */
> -       if (!o->skip_sparse_checkout)
> -               mark_new_skip_worktree(o->el, o->src_index, 0, CE_NEW_SKIP_WORKTREE);
> +       if (!o->skip_sparse_checkout) {
> +               get_sparse_checkout_hash(sparse_checkout_hash);
> +
> +               if (!is_null_sha1(sparse_checkout_hash) &&
> +                   !hashcmp(o->src_index->sparse_checkout, sparse_checkout_hash)) {
> +                       struct index_state *istate = o->src_index;
> +                       for (i = 0; i < istate->cache_nr; i++) {
> +                               struct cache_entry *ce = istate->cache[i];
> +                               if (ce_skip_worktree(ce))
> +                                       ce->ce_flags |= CE_NEW_SKIP_WORKTREE;
> +                       }
> +               } else
> +                       mark_new_skip_worktree(o->el, o->src_index,
> +                                              0, CE_NEW_SKIP_WORKTREE);
> +       }
>
>         if (!dfc)
>                 dfc = xcalloc(1, cache_entry_size(0)); @@ -1236,6 
> +1269,11 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>                         ret = unpack_failed(o, "Sparse checkout leaves no entry on working directory");
>                         goto done;
>                 }
> +
> +               if (o->dst_index && !is_null_sha1(sparse_checkout_hash)) {
> +                       hashcpy(o->result.sparse_checkout, sparse_checkout_hash);
> +                       o->result.cache_changed |= SPARSE_CHECKOUT_CHANGED;
> +               }
>         }
>
>         o->src_index = NULL;
> --
> 2.8.2.537.g0965dd9
>



--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-08-24 21:17 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-11 18:15 [PATCH] Speed up sparse checkout when $GIT_DIR/info/sparse-checkout is unchanged Nguyễn Thái Ngọc Duy
2016-08-13  8:37 ` Duy Nguyen
2016-08-24 21:17   ` Ben Peart

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).