git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v1] checkout: optionally speed up "git checkout -b foo"
@ 2018-07-24 18:01 Ben Peart
  2018-07-24 18:42 ` Eric Sunshine
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Ben Peart @ 2018-07-24 18:01 UTC (permalink / raw)
  To: git@vger.kernel.org; +Cc: gitster@pobox.com, Ben Peart

From: Ben Peart <Ben.Peart@microsoft.com>

If the new core.optimizecheckout config setting is set to true, speed up
"git checkout -b foo" by avoiding the work to merge the working tree.  This
is valid because no merge needs to occur - only creating the new branch/
updating the refs. Any other options force it through the old code path.

This change in behavior is off by default and behind the config setting so
that users have to opt-in to the optimized behavior.

We've been running with this patch internally for a long time but it was
rejected when I submitted it to the mailing list before because it
implicitly changes the behavior of checkout -b. Trying it again configured
behind a config setting as a potential solution for other optimizations to
checkout that could change the behavior as well.

https://public-inbox.org/git/20180724042740.GB13248@sigill.intra.peff.net/T/#m75afe3ab318d23f36334cf3a6e3d058839592469

Signed-off-by: Ben Peart <Ben.Peart@microsoft.com>
---

Notes:
    Base Ref: master
    Web-Diff: https://github.com/benpeart/git/commit/f43d934ce7
    Checkout: git fetch https://github.com/benpeart/git checkout-b-v1 && git checkout f43d934ce7

 Documentation/config.txt |  6 +++
 builtin/checkout.c       | 94 ++++++++++++++++++++++++++++++++++++++++
 cache.h                  |  1 +
 config.c                 |  5 +++
 environment.c            |  1 +
 5 files changed, 107 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index a32172a43c..2c4f513bf1 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -911,6 +911,12 @@ core.commitGraph::
 	Enable git commit graph feature. Allows reading from the
 	commit-graph file.
 
+core.optimizedCheckout
+	Speed up "git checkout -b foo" by skipping much of the work of a
+	full checkout command.  This changs the behavior as it will skip
+	merging the trees and updating the index and instead only create
+	and switch to the new ref.
+
 core.sparseCheckout::
 	Enable "sparse checkout" feature. See section "Sparse checkout" in
 	linkgit:git-read-tree[1] for more information.
diff --git a/builtin/checkout.c b/builtin/checkout.c
index 28627650cd..b186a3201e 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -41,6 +41,10 @@ struct checkout_opts {
 	int ignore_skipworktree;
 	int ignore_other_worktrees;
 	int show_progress;
+	/*
+	 * If new checkout options are added, needs_working_tree_merge
+	 * should be updated accordingly.
+	 */
 
 	const char *new_branch;
 	const char *new_branch_force;
@@ -471,6 +475,88 @@ static void setup_branch_path(struct branch_info *branch)
 	branch->path = strbuf_detach(&buf, NULL);
 }
 
+static int needs_working_tree_merge(const struct checkout_opts *opts,
+	const struct branch_info *old_branch_info,
+	const struct branch_info *new_branch_info)
+{
+	/*
+	 * We must do the merge if we are actually moving to a new
+	 * commit tree.
+	 */
+	if (!old_branch_info->commit || !new_branch_info->commit ||
+		oidcmp(&old_branch_info->commit->object.oid, &new_branch_info->commit->object.oid))
+		return 1;
+
+	/*
+	 * opts->patch_mode cannot be used with switching branches so is
+	 * not tested here
+	 */
+
+	/*
+	 * opts->quiet only impacts output so doesn't require a merge
+	 */
+
+	/*
+	 * Honor the explicit request for a three-way merge or to throw away
+	 * local changes
+	 */
+	if (opts->merge || opts->force)
+		return 1;
+
+	/*
+	 * --detach is documented as "updating the index and the files in the
+	 * working tree" but this optimization skips those steps so fall through
+	 * to the regular code path.
+	 */
+	if (opts->force_detach)
+		return 1;
+
+	/*
+	 * opts->writeout_stage cannot be used with switching branches so is
+	 * not tested here
+	 */
+
+	/*
+	 * Honor the explicit ignore requests
+	 */
+	if (!opts->overwrite_ignore || opts->ignore_skipworktree ||
+		opts->ignore_other_worktrees)
+		return 1;
+
+	/*
+	 * opts->show_progress only impacts output so doesn't require a merge
+	 */
+
+	/*
+	 * If we aren't creating a new branch any changes or updates will
+	 * happen in the existing branch.  Since that could only be updating
+	 * the index and working directory, we don't want to skip those steps
+	 * or we've defeated any purpose in running the command.
+	 */
+	if (!opts->new_branch)
+		return 1;
+
+	/*
+	 * new_branch_force is defined to "create/reset and checkout a branch"
+	 * so needs to go through the merge to do the reset
+	 */
+	if (opts->new_branch_force)
+		return 1;
+
+	/*
+	 * A new orphaned branch requrires the index and the working tree to be
+	 * adjusted to <start_point>
+	 */
+	if (opts->new_orphan_branch)
+		return 1;
+
+	/*
+	 * Remaining variables are not checkout options but used to track state
+	 */
+
+	return 0;
+}
+
 static int merge_working_tree(const struct checkout_opts *opts,
 			      struct branch_info *old_branch_info,
 			      struct branch_info *new_branch_info,
@@ -479,6 +565,14 @@ static int merge_working_tree(const struct checkout_opts *opts,
 	int ret;
 	struct lock_file lock_file = LOCK_INIT;
 
+	/*
+	 * Skip merging the trees, updating the index, and work tree only if we
+	 * are simply creating a new branch via "git checkout -b foo."  Any
+	 * other options or usage will continue to do all these steps.
+	 */
+	if (core_optimize_checkout && !needs_working_tree_merge(opts, old_branch_info, new_branch_info))
+		return 0;
+
 	hold_locked_index(&lock_file, LOCK_DIE_ON_ERROR);
 	if (read_cache_preload(NULL) < 0)
 		return error(_("index file corrupt"));
diff --git a/cache.h b/cache.h
index 8b447652a7..7f2b386c67 100644
--- a/cache.h
+++ b/cache.h
@@ -815,6 +815,7 @@ extern int fsync_object_files;
 extern int core_preload_index;
 extern int core_commit_graph;
 extern int core_apply_sparse_checkout;
+extern int core_optimize_checkout;
 extern int precomposed_unicode;
 extern int protect_hfs;
 extern int protect_ntfs;
diff --git a/config.c b/config.c
index 7968ef7566..b59a303e52 100644
--- a/config.c
+++ b/config.c
@@ -1319,6 +1319,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.optimizecheckout")) {
+		core_optimize_checkout = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.precomposeunicode")) {
 		precomposed_unicode = git_config_bool(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index 013e845235..1991fb4855 100644
--- a/environment.c
+++ b/environment.c
@@ -68,6 +68,7 @@ char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_commit_graph;
 int core_apply_sparse_checkout;
+int core_optimize_checkout;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
 unsigned long pack_size_limit_cfg;

base-commit: b7bd9486b055c3f967a870311e704e3bb0654e4f
-- 
2.17.0.gvfs.1.123.g449c066


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v1] checkout: optionally speed up "git checkout -b foo"
  2018-07-24 18:01 [PATCH v1] checkout: optionally speed up "git checkout -b foo" Ben Peart
@ 2018-07-24 18:42 ` Eric Sunshine
  2018-07-24 19:45   ` Ben Peart
  2018-07-24 19:21 ` Junio C Hamano
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Eric Sunshine @ 2018-07-24 18:42 UTC (permalink / raw)
  To: Ben Peart; +Cc: Git List, Junio C Hamano

On Tue, Jul 24, 2018 at 2:01 PM Ben Peart <Ben.Peart@microsoft.com> wrote:
> If the new core.optimizecheckout config setting is set to true, speed up

Maybe:

    Add core.optimizeCheckout config setting which, when true, speeds up

> "git checkout -b foo" by avoiding the work to merge the working tree.  This
> is valid because no merge needs to occur - only creating the new branch/
> updating the refs. Any other options force it through the old code path.
>
> This change in behavior is off by default and behind the config setting so
> that users have to opt-in to the optimized behavior.
>
> We've been running with this patch internally for a long time but it was
> rejected when I submitted it to the mailing list before because it
> implicitly changes the behavior of checkout -b. Trying it again configured
> behind a config setting as a potential solution for other optimizations to
> checkout that could change the behavior as well.

This paragraph is mere commentary which probably belongs below the
"---" line following your sign-off.

> https://public-inbox.org/git/20180724042740.GB13248@sigill.intra.peff.net/T/#m75afe3ab318d23f36334cf3a6e3d058839592469

Is this link meant to reference the previous attempt of optimizing
"checkout -b"? Although there's a single mention of "checkout -b" in
that discussion, it doesn't seem to be the previous attempt or explain
why it was rejected.

It would be quite nice to see a discussion in both the commit message
and the documentation about the pros and cons of enabling this
optimization. That it was previously rejected suggests that there may
be serious or unexpected consequences. How will a typical user know
whether its use is desirable or not?

> Signed-off-by: Ben Peart <Ben.Peart@microsoft.com>
> ---
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> @@ -911,6 +911,12 @@ core.commitGraph::
> +core.optimizedCheckout
> +       Speed up "git checkout -b foo" by skipping much of the work of a
> +       full checkout command.  This changs the behavior as it will skip

s/changs/changes/

> +       merging the trees and updating the index and instead only create
> +       and switch to the new ref.
> diff --git a/builtin/checkout.c b/builtin/checkout.c
> @@ -471,6 +475,88 @@ static void setup_branch_path(struct branch_info *branch)
> +static int needs_working_tree_merge(const struct checkout_opts *opts,
> +       const struct branch_info *old_branch_info,
> +       const struct branch_info *new_branch_info)
> +{
> +       /*
> +        * We must do the merge if we are actually moving to a new
> +        * commit tree.
> +        */
> +       if (!old_branch_info->commit || !new_branch_info->commit ||
> +               oidcmp(&old_branch_info->commit->object.oid, &new_branch_info->commit->object.oid))
> +               return 1;
> +       [...]
> +       return 0;
> +}

This long list of special-case checks doesn't leave me too enthused,
however, that aside, this approach seems backward. Rather than erring
on the side of safety by falling back to the merging behavior, it errs
in the other direction, which may be a problem if this list of
special-case checks ever gets out of sync with 'checkout_opts'. That
is, if someone adds a new option which ought to employ the merging
behavior, but forgets to update this function, then this function will
incorrectly default to using the optimization.

A safer approach would be the inverse, namely:

    static int skip_worktree_merge(...)
    {
        if (...meets all criteria for optimization...)
            return 1;
        return 0;
    }

>  static int merge_working_tree(const struct checkout_opts *opts,
>                               struct branch_info *old_branch_info,
>                               struct branch_info *new_branch_info,
> {
> +       /*
> +        * Skip merging the trees, updating the index, and work tree only if we
> +        * are simply creating a new branch via "git checkout -b foo."  Any
> +        * other options or usage will continue to do all these steps.
> +        */
> +       if (core_optimize_checkout && !needs_working_tree_merge(opts, old_branch_info, new_branch_info))
> +               return 0;

This seems a somewhat odd place to hook in this optimization,
especially as there is only a single caller of this function. Instead,
one might expect the caller itself to make this judgment and avoid
trying the merge in the first place if not needed. That is, in
switch_branches:

    if (!skip_worktree_merge(...))
        ret = merge_working_tree(...);

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1] checkout: optionally speed up "git checkout -b foo"
  2018-07-24 18:01 [PATCH v1] checkout: optionally speed up "git checkout -b foo" Ben Peart
  2018-07-24 18:42 ` Eric Sunshine
@ 2018-07-24 19:21 ` Junio C Hamano
  2018-07-24 20:47   ` Ben Peart
  2018-07-31 16:39 ` [PATCH v2] checkout: optimize "git checkout -b <new_branch>" Ben Peart
  2018-08-16 18:27 ` [PATCH v3] " Ben Peart
  3 siblings, 1 reply; 32+ messages in thread
From: Junio C Hamano @ 2018-07-24 19:21 UTC (permalink / raw)
  To: Ben Peart; +Cc: git@vger.kernel.org

Ben Peart <Ben.Peart@microsoft.com> writes:

> From: Ben Peart <Ben.Peart@microsoft.com>
>
> If the new core.optimizecheckout config setting is set to true, speed up
> "git checkout -b foo" by avoiding the work to merge the working tree.  This
> is valid because no merge needs to occur - only creating the new branch/
> updating the refs. Any other options force it through the old code path.
>
> This change in behavior is off by default and behind the config setting so
> that users have to opt-in to the optimized behavior.




> We've been running with this patch internally for a long time but it was
> rejected when I submitted it to the mailing list before because it
> implicitly changes the behavior of checkout -b. Trying it again configured
> behind a config setting as a potential solution for other optimizations to
> checkout that could change the behavior as well.
>
> https://public-inbox.org/git/20180724042740.GB13248@sigill.intra.peff.net/T/#m75afe3ab318d23f36334cf3a6e3d058839592469

An incorrect link?  It does not look like a thread that explains
what was previously submitted but failed.  The last paragraph looks
like a fine material below the three-dash line.


> Signed-off-by: Ben Peart <Ben.Peart@microsoft.com>
> ---
>
> Notes:
>     Base Ref: master
>     Web-Diff: https://github.com/benpeart/git/commit/f43d934ce7
>     Checkout: git fetch https://github.com/benpeart/git checkout-b-v1 && git checkout f43d934ce7
>
>  Documentation/config.txt |  6 +++
>  builtin/checkout.c       | 94 ++++++++++++++++++++++++++++++++++++++++
>  cache.h                  |  1 +
>  config.c                 |  5 +++
>  environment.c            |  1 +
>  5 files changed, 107 insertions(+)
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index a32172a43c..2c4f513bf1 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -911,6 +911,12 @@ core.commitGraph::
>  	Enable git commit graph feature. Allows reading from the
>  	commit-graph file.
>  
> +core.optimizedCheckout
> +	Speed up "git checkout -b foo" by skipping much of the work of a
> +	full checkout command.  This changs the behavior as it will skip
> +	merging the trees and updating the index and instead only create
> +	and switch to the new ref.

By the way, why is it a core.* thing, not checkout.* thing?

If a new feature is not necessarily recommendable for normal users
and it needs to be hidden behind an opt-in knob (I do not have a
strong opinion if that is or is not the case for this particular
feature at this point), the documentation for the knob should give a
bit more than "This chang(e)s the behavior" to the readers, I would
think, to be intellectually honest ;-).  Let's tell them what bad
things happen if we pretend that we switched the branch without
twoway merge and the index update to help them make an informed
decision.

> +static int needs_working_tree_merge(const struct checkout_opts *opts,
> +	const struct branch_info *old_branch_info,
> +	const struct branch_info *new_branch_info)
> +{
> +	/*
> +	 * We must do the merge if we are actually moving to a new
> +	 * commit tree.

What's a "commit tree"?  Shouldn't it be just a "commit"?

> +	 */
> +	if (!old_branch_info->commit || !new_branch_info->commit ||
> +		oidcmp(&old_branch_info->commit->object.oid, &new_branch_info->commit->object.oid))
> +		return 1;
> +
> +	/*
> +	 * opts->patch_mode cannot be used with switching branches so is
> +	 * not tested here
> +	 */
> +
> +	/*
> +	 * opts->quiet only impacts output so doesn't require a merge
> +	 */
> +
> +	/*
> +	 * Honor the explicit request for a three-way merge or to throw away
> +	 * local changes
> +	 */
> +	if (opts->merge || opts->force)
> +		return 1;
> +
> +	/*
> +	 * --detach is documented as "updating the index and the files in the
> +	 * working tree" but this optimization skips those steps so fall through
> +	 * to the regular code path.
> +	 */
> +	if (opts->force_detach)
> +		return 1;
> +
> +	/*
> +	 * opts->writeout_stage cannot be used with switching branches so is
> +	 * not tested here
> +	 */
> +
> +	/*
> +	 * Honor the explicit ignore requests
> +	 */
> +	if (!opts->overwrite_ignore || opts->ignore_skipworktree ||
> +		opts->ignore_other_worktrees)
> +		return 1;
> +
> +	/*
> +	 * opts->show_progress only impacts output so doesn't require a merge
> +	 */
> +
> +	/*
> +	 * If we aren't creating a new branch any changes or updates will
> +	 * happen in the existing branch.  Since that could only be updating
> +	 * the index and working directory, we don't want to skip those steps
> +	 * or we've defeated any purpose in running the command.
> +	 */
> +	if (!opts->new_branch)
> +		return 1;
> +
> +	/*
> +	 * new_branch_force is defined to "create/reset and checkout a branch"
> +	 * so needs to go through the merge to do the reset
> +	 */
> +	if (opts->new_branch_force)
> +		return 1;
> +
> +	/*
> +	 * A new orphaned branch requrires the index and the working tree to be
> +	 * adjusted to <start_point>
> +	 */
> +	if (opts->new_orphan_branch)
> +		return 1;
> +
> +	/*
> +	 * Remaining variables are not checkout options but used to track state
> +	 */
> +
> +	return 0;
> +}

This helper function alone looks like we are creating a maintenance
nightmare from a quick scan.  How are we going to keep this up to
date?

I offhand do not know how "git checkout -b foo" would behave
differently if we do not do a two-way merge between HEAD and HEAD to
update the index.  We'd still need to list the local modifications
and say "Switched to a new branch 'foo'", but that would be a minor
thing compared to the two-way merge machinery.

Was the primary reason why the patch "changes the behaviour" because
nobody could prove that needs_working_tree_merge() helper reliably
detects that "checkout -b foo" case and that case alone, and show a
way to make sure it will keep doing so in the future when other new
features are added to the command?

> @@ -479,6 +565,14 @@ static int merge_working_tree(const struct checkout_opts *opts,
>  	int ret;
>  	struct lock_file lock_file = LOCK_INIT;
>  
> +	/*
> +	 * Skip merging the trees, updating the index, and work tree only if we
> +	 * are simply creating a new branch via "git checkout -b foo."  Any
> +	 * other options or usage will continue to do all these steps.
> +	 */
> +	if (core_optimize_checkout && !needs_working_tree_merge(opts, old_branch_info, new_branch_info))
> +		return 0;
> +
>  	hold_locked_index(&lock_file, LOCK_DIE_ON_ERROR);
>  	if (read_cache_preload(NULL) < 0)
>  		return error(_("index file corrupt"));

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1] checkout: optionally speed up "git checkout -b foo"
  2018-07-24 18:42 ` Eric Sunshine
@ 2018-07-24 19:45   ` Ben Peart
  2018-07-26 15:04     ` Junio C Hamano
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Peart @ 2018-07-24 19:45 UTC (permalink / raw)
  To: Eric Sunshine, Ben Peart; +Cc: Git List, Junio C Hamano



On 7/24/2018 2:42 PM, Eric Sunshine wrote:
> On Tue, Jul 24, 2018 at 2:01 PM Ben Peart <Ben.Peart@microsoft.com> wrote:
>> If the new core.optimizecheckout config setting is set to true, speed up
> 
> Maybe:
> 
>      Add core.optimizeCheckout config setting which, when true, speeds up
> 

Sure

>> "git checkout -b foo" by avoiding the work to merge the working tree.  This
>> is valid because no merge needs to occur - only creating the new branch/
>> updating the refs. Any other options force it through the old code path.
>>
>> This change in behavior is off by default and behind the config setting so
>> that users have to opt-in to the optimized behavior.
>>
>> We've been running with this patch internally for a long time but it was
>> rejected when I submitted it to the mailing list before because it
>> implicitly changes the behavior of checkout -b. Trying it again configured
>> behind a config setting as a potential solution for other optimizations to
>> checkout that could change the behavior as well.
> 
> This paragraph is mere commentary which probably belongs below the
> "---" line following your sign-off.
> 

Hopefully this commentary (I'll move it below the --- line) is clearer:

We've been running with this patch internally for a long time but it was
rejected when I submitted it to the mailing list before [1] because it
implicitly changes the behavior of checkout -b as it no longer updates
the working directory.

I'm submitting it again behind a config setting so that it doesn't cause
any back compat issues unless the user explicitly opts in to the new
behavior. My hope is this same setting and model can be used if/when we
make other performance optimizations to checkout like using the cache
tree to avoid having to traverse the entire tree being discussed [2].

[1] 
https://public-inbox.org/git/20160909192520.4812-1-benpeart@microsoft.com/
[2] 
https://public-inbox.org/git/20180724042740.GB13248@sigill.intra.peff.net/T/#m75afe3ab318d23f36334cf3a6e3d058839592469

>> https://public-inbox.org/git/20180724042740.GB13248@sigill.intra.peff.net/T/#m75afe3ab318d23f36334cf3a6e3d058839592469
> 
> Is this link meant to reference the previous attempt of optimizing
> "checkout -b"? Although there's a single mention of "checkout -b" in
> that discussion, it doesn't seem to be the previous attempt or explain
> why it was rejected.
> 
> It would be quite nice to see a discussion in both the commit message
> and the documentation about the pros and cons of enabling this
> optimization. That it was previously rejected suggests that there may
> be serious or unexpected consequences. How will a typical user know
> whether its use is desirable or not?
> 
>> Signed-off-by: Ben Peart <Ben.Peart@microsoft.com>
>> ---
>> diff --git a/Documentation/config.txt b/Documentation/config.txt
>> @@ -911,6 +911,12 @@ core.commitGraph::
>> +core.optimizedCheckout
>> +       Speed up "git checkout -b foo" by skipping much of the work of a
>> +       full checkout command.  This changs the behavior as it will skip
> 
> s/changs/changes/
> 
>> +       merging the trees and updating the index and instead only create
>> +       and switch to the new ref.
>> diff --git a/builtin/checkout.c b/builtin/checkout.c
>> @@ -471,6 +475,88 @@ static void setup_branch_path(struct branch_info *branch)
>> +static int needs_working_tree_merge(const struct checkout_opts *opts,
>> +       const struct branch_info *old_branch_info,
>> +       const struct branch_info *new_branch_info)
>> +{
>> +       /*
>> +        * We must do the merge if we are actually moving to a new
>> +        * commit tree.
>> +        */
>> +       if (!old_branch_info->commit || !new_branch_info->commit ||
>> +               oidcmp(&old_branch_info->commit->object.oid, &new_branch_info->commit->object.oid))
>> +               return 1;
>> +       [...]
>> +       return 0;
>> +}
> 
> This long list of special-case checks doesn't leave me too enthused,
> however, that aside, this approach seems backward. Rather than erring
> on the side of safety by falling back to the merging behavior, it errs
> in the other direction, which may be a problem if this list of
> special-case checks ever gets out of sync with 'checkout_opts'. That
> is, if someone adds a new option which ought to employ the merging
> behavior, but forgets to update this function, then this function will
> incorrectly default to using the optimization.
> 

I'm not thrilled with the long list either (the plethora of comments 
probably makes it appear worse than it is) but I don't see how flipping 
the logic around makes it fail if someone adds a new option.  The "meets 
all criteria for optimization" code can only test existing options.

> A safer approach would be the inverse, namely:
> 
>      static int skip_worktree_merge(...)
>      {
>          if (...meets all criteria for optimization...)
>              return 1;
>          return 0;
>      }
> 
>>   static int merge_working_tree(const struct checkout_opts *opts,
>>                                struct branch_info *old_branch_info,
>>                                struct branch_info *new_branch_info,
>> {
>> +       /*
>> +        * Skip merging the trees, updating the index, and work tree only if we
>> +        * are simply creating a new branch via "git checkout -b foo."  Any
>> +        * other options or usage will continue to do all these steps.
>> +        */
>> +       if (core_optimize_checkout && !needs_working_tree_merge(opts, old_branch_info, new_branch_info))
>> +               return 0;
> 
> This seems a somewhat odd place to hook in this optimization,
> especially as there is only a single caller of this function. Instead,
> one might expect the caller itself to make this judgment and avoid
> trying the merge in the first place if not needed. That is, in
> switch_branches:
> 
>      if (!skip_worktree_merge(...))
>          ret = merge_working_tree(...);
> 

I personally agree, it was moved to its current location per feedback 
the first time around.  Perhaps with the addition of the config setting 
it will be better received moved out to the caller.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1] checkout: optionally speed up "git checkout -b foo"
  2018-07-24 19:21 ` Junio C Hamano
@ 2018-07-24 20:47   ` Ben Peart
  0 siblings, 0 replies; 32+ messages in thread
From: Ben Peart @ 2018-07-24 20:47 UTC (permalink / raw)
  To: Junio C Hamano, Ben Peart; +Cc: git@vger.kernel.org



On 7/24/2018 3:21 PM, Junio C Hamano wrote:
> Ben Peart <Ben.Peart@microsoft.com> writes:
> 
>> From: Ben Peart <Ben.Peart@microsoft.com>
>>
>> If the new core.optimizecheckout config setting is set to true, speed up
>> "git checkout -b foo" by avoiding the work to merge the working tree.  This
>> is valid because no merge needs to occur - only creating the new branch/
>> updating the refs. Any other options force it through the old code path.
>>
>> This change in behavior is off by default and behind the config setting so
>> that users have to opt-in to the optimized behavior.
> 
> 
> 
> 
>> We've been running with this patch internally for a long time but it was
>> rejected when I submitted it to the mailing list before because it
>> implicitly changes the behavior of checkout -b. Trying it again configured
>> behind a config setting as a potential solution for other optimizations to
>> checkout that could change the behavior as well.
>>
>> https://public-inbox.org/git/20180724042740.GB13248@sigill.intra.peff.net/T/#m75afe3ab318d23f36334cf3a6e3d058839592469
> 
> An incorrect link?  It does not look like a thread that explains
> what was previously submitted but failed.  The last paragraph looks
> like a fine material below the three-dash line.
> 

See my earlier reply about this section in:

https://public-inbox.org/git/xmqqh8koxwwi.fsf@gitster-ct.c.googlers.com/T/#mb31136a09dbc1a963a5a62e840b118ac33043edf

> 
>> Signed-off-by: Ben Peart <Ben.Peart@microsoft.com>
>> ---
>>
>> Notes:
>>      Base Ref: master
>>      Web-Diff: https://github.com/benpeart/git/commit/f43d934ce7
>>      Checkout: git fetch https://github.com/benpeart/git checkout-b-v1 && git checkout f43d934ce7
>>
>>   Documentation/config.txt |  6 +++
>>   builtin/checkout.c       | 94 ++++++++++++++++++++++++++++++++++++++++
>>   cache.h                  |  1 +
>>   config.c                 |  5 +++
>>   environment.c            |  1 +
>>   5 files changed, 107 insertions(+)
>>
>> diff --git a/Documentation/config.txt b/Documentation/config.txt
>> index a32172a43c..2c4f513bf1 100644
>> --- a/Documentation/config.txt
>> +++ b/Documentation/config.txt
>> @@ -911,6 +911,12 @@ core.commitGraph::
>>   	Enable git commit graph feature. Allows reading from the
>>   	commit-graph file.
>>   
>> +core.optimizedCheckout
>> +	Speed up "git checkout -b foo" by skipping much of the work of a
>> +	full checkout command.  This changs the behavior as it will skip
>> +	merging the trees and updating the index and instead only create
>> +	and switch to the new ref.
> 
> By the way, why is it a core.* thing, not checkout.* thing?
> 

I followed the naming convention used by core.sparsecheckout but I'm 
happy to call it whatever people want.

> If a new feature is not necessarily recommendable for normal users
> and it needs to be hidden behind an opt-in knob (I do not have a
> strong opinion if that is or is not the case for this particular
> feature at this point), the documentation for the knob should give a
> bit more than "This chang(e)s the behavior" to the readers, I would
> think, to be intellectually honest ;-).  Let's tell them what bad
> things happen if we pretend that we switched the branch without
> twoway merge and the index update to help them make an informed
> decision.
> 

I attempted to explain what the change in behavior was in the same 
sentence by saying what it skips and what it still does.  If that isn't 
intellectually honest, help me know how to better explain it so that it is.

Is this better?

Speed up "git checkout -b <new_branch>" by skipping the twoway merge and 
index update.  Instead, just create a new branch named <new_branch> and 
switch to it.  The working directory and index are left unchanged.

>> +static int needs_working_tree_merge(const struct checkout_opts *opts,
>> +	const struct branch_info *old_branch_info,
>> +	const struct branch_info *new_branch_info)
>> +{
>> +	/*
>> +	 * We must do the merge if we are actually moving to a new
>> +	 * commit tree.
> 
> What's a "commit tree"?  Shouldn't it be just a "commit"?
> 

Sure

>> +	 */
>> +	if (!old_branch_info->commit || !new_branch_info->commit ||
>> +		oidcmp(&old_branch_info->commit->object.oid, &new_branch_info->commit->object.oid))
>> +		return 1;
>> +
>> +	/*
>> +	 * opts->patch_mode cannot be used with switching branches so is
>> +	 * not tested here
>> +	 */
>> +
>> +	/*
>> +	 * opts->quiet only impacts output so doesn't require a merge
>> +	 */
>> +
>> +	/*
>> +	 * Honor the explicit request for a three-way merge or to throw away
>> +	 * local changes
>> +	 */
>> +	if (opts->merge || opts->force)
>> +		return 1;
>> +
>> +	/*
>> +	 * --detach is documented as "updating the index and the files in the
>> +	 * working tree" but this optimization skips those steps so fall through
>> +	 * to the regular code path.
>> +	 */
>> +	if (opts->force_detach)
>> +		return 1;
>> +
>> +	/*
>> +	 * opts->writeout_stage cannot be used with switching branches so is
>> +	 * not tested here
>> +	 */
>> +
>> +	/*
>> +	 * Honor the explicit ignore requests
>> +	 */
>> +	if (!opts->overwrite_ignore || opts->ignore_skipworktree ||
>> +		opts->ignore_other_worktrees)
>> +		return 1;
>> +
>> +	/*
>> +	 * opts->show_progress only impacts output so doesn't require a merge
>> +	 */
>> +
>> +	/*
>> +	 * If we aren't creating a new branch any changes or updates will
>> +	 * happen in the existing branch.  Since that could only be updating
>> +	 * the index and working directory, we don't want to skip those steps
>> +	 * or we've defeated any purpose in running the command.
>> +	 */
>> +	if (!opts->new_branch)
>> +		return 1;
>> +
>> +	/*
>> +	 * new_branch_force is defined to "create/reset and checkout a branch"
>> +	 * so needs to go through the merge to do the reset
>> +	 */
>> +	if (opts->new_branch_force)
>> +		return 1;
>> +
>> +	/*
>> +	 * A new orphaned branch requrires the index and the working tree to be
>> +	 * adjusted to <start_point>
>> +	 */
>> +	if (opts->new_orphan_branch)
>> +		return 1;
>> +
>> +	/*
>> +	 * Remaining variables are not checkout options but used to track state
>> +	 */
>> +
>> +	return 0;
>> +}
> 
> This helper function alone looks like we are creating a maintenance
> nightmare from a quick scan.  How are we going to keep this up to
> date?
> 
> I offhand do not know how "git checkout -b foo" would behave
> differently if we do not do a two-way merge between HEAD and HEAD to
> update the index.  We'd still need to list the local modifications
> and say "Switched to a new branch 'foo'", but that would be a minor
> thing compared to the two-way merge machinery.
> 

With the patch, the index and working directory aren't modified, it does 
still list the local modifications and says "Switched to..."

The changes in behavior come from other inputs.  For example, if you 
turned on sparse checkout and then did a "git checkout -b foo" the old 
code path would update the skip-worktree bit in the index and remove all 
the files from the working directory that were no longer specified in 
the sparse-checkout file.  The new behavior would not do that as it no 
longer updates the index or working directory.

The checkout documentation doesn't _say_ that "git checkout -b foo" will 
also update the working directory after changing the sparse-checkout 
settings but it happens to do that.  Is that a common scenario? No - but 
it is an undocumented behavior that will change if they opt in to this 
new behavior.  At least the documentation for the new flag _does_ let 
the user know the working directory and index won't be changed so if 
they opt in, they shouldn't be too surprised.

> Was the primary reason why the patch "changes the behaviour" because
> nobody could prove that needs_working_tree_merge() helper reliably
> detects that "checkout -b foo" case and that case alone, and show a
> way to make sure it will keep doing so in the future when other new
> features are added to the command?
> 

My concern isn't ensuring that the patch reliably detects "checkout -b", 
the challenge is ensuring it will keep doing so as features are added 
that impact the "-b" option.  The problem of adding a new option and 
having to ensure it behaves properly with all the other options isn't 
new but I agree this does add one more case that has to be handled.

>> @@ -479,6 +565,14 @@ static int merge_working_tree(const struct checkout_opts *opts,
>>   	int ret;
>>   	struct lock_file lock_file = LOCK_INIT;
>>   
>> +	/*
>> +	 * Skip merging the trees, updating the index, and work tree only if we
>> +	 * are simply creating a new branch via "git checkout -b foo."  Any
>> +	 * other options or usage will continue to do all these steps.
>> +	 */
>> +	if (core_optimize_checkout && !needs_working_tree_merge(opts, old_branch_info, new_branch_info))
>> +		return 0;
>> +
>>   	hold_locked_index(&lock_file, LOCK_DIE_ON_ERROR);
>>   	if (read_cache_preload(NULL) < 0)
>>   		return error(_("index file corrupt"));
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1] checkout: optionally speed up "git checkout -b foo"
  2018-07-24 19:45   ` Ben Peart
@ 2018-07-26 15:04     ` Junio C Hamano
  2018-07-26 18:59       ` Eric Sunshine
  0 siblings, 1 reply; 32+ messages in thread
From: Junio C Hamano @ 2018-07-26 15:04 UTC (permalink / raw)
  To: Ben Peart; +Cc: Eric Sunshine, Ben Peart, Git List

Ben Peart <peartben@gmail.com> writes:

[jc: it was a bit surprising that Eric covered all the bits I
covered while we were writing without peeking each other's ;-)]

>> This long list of special-case checks doesn't leave me too enthused,
>> however, that aside, this approach seems backward. Rather than erring
>> on the side of safety by falling back to the merging behavior, it errs
>> in the other direction, ...
>
> I'm not thrilled with the long list either (the plethora of comments
> probably makes it appear worse than it is) but I don't see how...

If there were a simple and futureproof way to tell the option
parsing loop to notice any feature other than "-b newbranch" was
used, then such a whitelisting may be a viable way, but there is
not, and the whitelist condition can become (over time---we are
talking about futureproofing and not "a filter that happens to match
today's feature set") just as complex as this blacklisting function
is (e.g. feature A and feature B when used alone may be compatible
with the optimization but not when used both), so if we were to use
this optimization, I think this long list of special-case checks is
the best we could do.

>>      if (!skip_worktree_merge(...))
>>          ret = merge_working_tree(...);
>>
>
> I personally agree, it was moved to its current location per feedback
> the first time around.  Perhaps with the addition of the config
> setting it will be better received moved out to the caller.

Sounds sensible.  I am still not enthused by the configuration
variable that is underdocumented.  You already said that we know
that this optimization does a wrong thing when sparse checkout is
used---any other cases?  I do not think of any myself, and if that
is true, I am wondering if it makes more sense to "do we have sparse
configuration?" as part of the blacklist condition.  That way, the
users do not have to set anything and they will get an optimization
benefit from an obvious change to skip "read-tree -m HEAD HEAD" that
ought to be a no-op.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1] checkout: optionally speed up "git checkout -b foo"
  2018-07-26 15:04     ` Junio C Hamano
@ 2018-07-26 18:59       ` Eric Sunshine
  2018-07-26 19:08         ` Eric Sunshine
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Sunshine @ 2018-07-26 18:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ben Peart, Ben Peart, Git List

On Thu, Jul 26, 2018 at 11:04 AM Junio C Hamano <gitster@pobox.com> wrote:
> Ben Peart <peartben@gmail.com> writes:
> > I'm not thrilled with the long list either (the plethora of comments
> > probably makes it appear worse than it is) but I don't see how...
>
> If there were a simple and futureproof way to tell the option
> parsing loop to notice any feature other than "-b newbranch" was
> used, then such a whitelisting may be a viable way, but there is
> not, and the whitelist condition can become (over time---we are
> talking about futureproofing and not "a filter that happens to match
> today's feature set") just as complex as this blacklisting function
> is (e.g. feature A and feature B when used alone may be compatible
> with the optimization but not when used both), so if we were to use
> this optimization, I think this long list of special-case checks is
> the best we could do.

I'm wondering if a two-stage parse-options invocations could make this
potential maintenance problem more manageable by altogether
eliminating needs_working_tree_merge(). Something very roughly along
the lines of:

    new_branch_and_passive_options = {
        OPT_STRING('b', NULL, ...),
        ...options which can't impact "optimization" decision...
    };
    argc = parse_options(argc, argv, prefix,
        new_branch_and_passive_options, NULL,
        PARSE_OPT_KEEP_UNKNOWN | PARSE_OPT_KEEP_DASHDASH);

    can_optimize_new_branch = 1;
    for (i = 0; i < argc; i++)
        if (argv[i][0] == '-') {
            can_optimize_new_branch = 0;
            break;
        }

    options = {
        ...all other options...
    }
    argc = parse_options(argc, argv, prefix, options,
        checkout_usage, PARSE_OPT_KEEP_DASHDASH);

    ...as before...

The can_optimize_new_branch check could, of course, be fooled by a
non-option which starts with a "-", but that would err toward safety
of not optimizing, so shouldn't be a worry.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1] checkout: optionally speed up "git checkout -b foo"
  2018-07-26 18:59       ` Eric Sunshine
@ 2018-07-26 19:08         ` Eric Sunshine
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Sunshine @ 2018-07-26 19:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ben Peart, Ben Peart, Git List

On Thu, Jul 26, 2018 at 2:59 PM Eric Sunshine <sunshine@sunshineco.com> wrote:
> On Thu, Jul 26, 2018 at 11:04 AM Junio C Hamano <gitster@pobox.com> wrote:
> > If there were a simple and futureproof way to tell the option
> > parsing loop to notice any feature other than "-b newbranch" was
> > used, then such a whitelisting may be a viable way [...]
>
> I'm wondering if a two-stage parse-options invocations could make this
> potential maintenance problem more manageable by altogether
> eliminating needs_working_tree_merge().

A downside of this approach is that it too becomes a nightmare if
git-checkout grows additional special cases like the proposed "-b", in
which case multi-stage or disjoint or intersecting parse-options
invocations might arise. Another downside is that this parse-options
idea is somewhat invasive, whereas, needs_working_tree_merge(),
despite its ugliness, is at least is self-contained and not at all
invasive.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2] checkout: optimize "git checkout -b <new_branch>"
  2018-07-24 18:01 [PATCH v1] checkout: optionally speed up "git checkout -b foo" Ben Peart
  2018-07-24 18:42 ` Eric Sunshine
  2018-07-24 19:21 ` Junio C Hamano
@ 2018-07-31 16:39 ` Ben Peart
  2018-07-31 20:01   ` Junio C Hamano
  2018-08-01 15:10   ` Duy Nguyen
  2018-08-16 18:27 ` [PATCH v3] " Ben Peart
  3 siblings, 2 replies; 32+ messages in thread
From: Ben Peart @ 2018-07-31 16:39 UTC (permalink / raw)
  To: Ben Peart
  Cc: git@vger.kernel.org, gitster@pobox.com, sunshine@sunshineco.com,
	Ben Peart

From: Ben Peart <Ben.Peart@microsoft.com>

Skip merging the commit, updating the index and working directory if and
only if we are creating a new branch via "git checkout -b <new_branch>."
Any other checkout options will still go through the former code path. 

If sparse_checkout is on, require the user to manually opt in to this
optimzed behavior by setting the config setting checkout.optimizeNewBranch
to true as we will no longer update the skip-worktree bit in the index, nor
add/remove files in the working directory to reflect the current sparse
checkout settings.

Signed-off-by: Ben Peart <Ben.Peart@microsoft.com>
---

The biggest change in this version was suggested in feedback to the last
patch.  I have turned on the optimzation by default if sparse-checkout is
not on so that most users do not have to set anything and they will get the
benefit of the optimization.

Because users that use sparse checkout are probably doing so because they
have a large working directory, they stand to benefit the most from this
optimization.  To enable them to benefit, I added a "checkout.optimizeNewBranch"
config setting that allows them to opt-in to this optimization if they are
willing to accept the change in behavior.  I updated the documentation which
should make it clear exactly what the change in behavior is.

Notes:
    Base Ref: master
    Web-Diff: https://github.com/benpeart/git/commit/5ea167fe90
    Checkout: git fetch https://github.com/benpeart/git checkout-b-v2 && git checkout 5ea167fe90
    
    ### Patches

 Documentation/config.txt         |   5 ++
 builtin/checkout.c               | 113 +++++++++++++++++++++++++++++--
 t/t1090-sparse-checkout-scope.sh |  14 ++++
 3 files changed, 128 insertions(+), 4 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 43b2de7b5f..acf81143d4 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1101,6 +1101,11 @@ browser.<tool>.path::
 	browse HTML help (see `-w` option in linkgit:git-help[1]) or a
 	working repository in gitweb (see linkgit:git-instaweb[1]).
 
+checkout.optimizeNewBranch
+	When set to true, "git checkout -b <new_branch>" will not update the
+	skip-worktree bit in the index nor add/remove files in the working
+	directory to reflect the current sparse checkout settings.
+
 clean.requireForce::
 	A boolean to make git-clean do nothing unless given -f,
 	-i or -n.   Defaults to true.
diff --git a/builtin/checkout.c b/builtin/checkout.c
index 28627650cd..991b71a341 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -24,6 +24,8 @@
 #include "submodule-config.h"
 #include "submodule.h"
 
+static int checkout_optimize_new_branch;
+
 static const char * const checkout_usage[] = {
 	N_("git checkout [<options>] <branch>"),
 	N_("git checkout [<options>] [<branch>] -- <file>..."),
@@ -41,6 +43,10 @@ struct checkout_opts {
 	int ignore_skipworktree;
 	int ignore_other_worktrees;
 	int show_progress;
+	/*
+	 * If new checkout options are added, skip_merge_working_tree
+	 * should be updated accordingly.
+	 */
 
 	const char *new_branch;
 	const char *new_branch_force;
@@ -471,6 +477,98 @@ static void setup_branch_path(struct branch_info *branch)
 	branch->path = strbuf_detach(&buf, NULL);
 }
 
+/*
+ * Skip merging the trees, updating the index and working directory if and
+ * only if we are creating a new branch via "git checkout -b <new_branch>."
+ */
+static int skip_merge_working_tree(const struct checkout_opts *opts,
+	const struct branch_info *old_branch_info,
+	const struct branch_info *new_branch_info)
+{
+	/*
+	 * Do the merge if sparse checkout is on and the user has not opted in
+	 * to the optimized behavior
+	 */
+	if (core_apply_sparse_checkout && !checkout_optimize_new_branch)
+		return 0;
+
+	/*
+	 * We must do the merge if we are actually moving to a new commit.
+	 */
+	if (!old_branch_info->commit || !new_branch_info->commit ||
+		oidcmp(&old_branch_info->commit->object.oid, &new_branch_info->commit->object.oid))
+		return 0;
+
+	/*
+	 * opts->patch_mode cannot be used with switching branches so is
+	 * not tested here
+	 */
+
+	/*
+	 * opts->quiet only impacts output so doesn't require a merge
+	 */
+
+	/*
+	 * Honor the explicit request for a three-way merge or to throw away
+	 * local changes
+	 */
+	if (opts->merge || opts->force)
+		return 0;
+
+	/*
+	 * --detach is documented as "updating the index and the files in the
+	 * working tree" but this optimization skips those steps so fall through
+	 * to the regular code path.
+	 */
+	if (opts->force_detach)
+		return 0;
+
+	/*
+	 * opts->writeout_stage cannot be used with switching branches so is
+	 * not tested here
+	 */
+
+	/*
+	 * Honor the explicit ignore requests
+	 */
+	if (!opts->overwrite_ignore || opts->ignore_skipworktree ||
+		opts->ignore_other_worktrees)
+		return 0;
+
+	/*
+	 * opts->show_progress only impacts output so doesn't require a merge
+	 */
+
+	/*
+	 * If we aren't creating a new branch any changes or updates will
+	 * happen in the existing branch.  Since that could only be updating
+	 * the index and working directory, we don't want to skip those steps
+	 * or we've defeated any purpose in running the command.
+	 */
+	if (!opts->new_branch)
+		return 0;
+
+	/*
+	 * new_branch_force is defined to "create/reset and checkout a branch"
+	 * so needs to go through the merge to do the reset
+	 */
+	if (opts->new_branch_force)
+		return 0;
+
+	/*
+	 * A new orphaned branch requrires the index and the working tree to be
+	 * adjusted to <start_point>
+	 */
+	if (opts->new_orphan_branch)
+		return 0;
+
+	/*
+	 * Remaining variables are not checkout options but used to track state
+	 */
+
+	return 1;
+}
+
 static int merge_working_tree(const struct checkout_opts *opts,
 			      struct branch_info *old_branch_info,
 			      struct branch_info *new_branch_info,
@@ -845,10 +943,12 @@ static int switch_branches(const struct checkout_opts *opts,
 		parse_commit_or_die(new_branch_info->commit);
 	}
 
-	ret = merge_working_tree(opts, &old_branch_info, new_branch_info, &writeout_error);
-	if (ret) {
-		free(path_to_free);
-		return ret;
+	if (!skip_merge_working_tree(opts, &old_branch_info, new_branch_info)) {
+		ret = merge_working_tree(opts, &old_branch_info, new_branch_info, &writeout_error);
+		if (ret) {
+			free(path_to_free);
+			return ret;
+		}
 	}
 
 	if (!opts->quiet && !old_branch_info.path && old_branch_info.commit && new_branch_info->commit != old_branch_info.commit)
@@ -863,6 +963,11 @@ static int switch_branches(const struct checkout_opts *opts,
 
 static int git_checkout_config(const char *var, const char *value, void *cb)
 {
+	if (!strcmp(var, "checkout.optimizenewbranch")) {
+		checkout_optimize_new_branch = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "diff.ignoresubmodules")) {
 		struct checkout_opts *opts = cb;
 		handle_ignore_submodules_arg(&opts->diff_options, value);
diff --git a/t/t1090-sparse-checkout-scope.sh b/t/t1090-sparse-checkout-scope.sh
index 1f61eb3e88..25d7c700f6 100755
--- a/t/t1090-sparse-checkout-scope.sh
+++ b/t/t1090-sparse-checkout-scope.sh
@@ -31,6 +31,20 @@ test_expect_success 'perform sparse checkout of master' '
 	test_path_is_file c
 '
 
+test_expect_success 'checkout -b checkout.optimizeNewBranch interaction' '
+	cp .git/info/sparse-checkout .git/info/sparse-checkout.bak &&
+	test_when_finished "
+		mv -f .git/info/sparse-checkout.bak .git/info/sparse-checkout
+		git checkout master
+	" &&
+	echo "/b" >>.git/info/sparse-checkout &&
+	test "$(git ls-files -t b)" = "S b" &&
+	git -c checkout.optimizeNewBranch=true checkout -b fast &&
+	test "$(git ls-files -t b)" = "S b" &&
+	git checkout -b slow &&
+	test "$(git ls-files -t b)" = "H b"
+'
+
 test_expect_success 'merge feature branch into sparse checkout of master' '
 	git merge feature &&
 	test_path_is_file a &&

base-commit: ffc6fa0e396238de3a30623912980263b4f283ab
-- 
2.17.0.gvfs.1.123.g449c066


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] checkout: optimize "git checkout -b <new_branch>"
  2018-07-31 16:39 ` [PATCH v2] checkout: optimize "git checkout -b <new_branch>" Ben Peart
@ 2018-07-31 20:01   ` Junio C Hamano
  2018-08-01 15:10   ` Duy Nguyen
  1 sibling, 0 replies; 32+ messages in thread
From: Junio C Hamano @ 2018-07-31 20:01 UTC (permalink / raw)
  To: Ben Peart; +Cc: git@vger.kernel.org, sunshine@sunshineco.com

Ben Peart <Ben.Peart@microsoft.com> writes:

> The biggest change in this version was suggested in feedback to the last
> patch.  I have turned on the optimzation by default if sparse-checkout is
> not on so that most users do not have to set anything and they will get the
> benefit of the optimization.

Sounds like a good thing to do.

If we missed something in the logic of skip_merge_working_tree(),
the breakage may affect more unsuspecting people, which may or may
not be a bad thing---at least it would allow us to notice it sooner.

> +	When set to true, "git checkout -b <new_branch>" will not update the
> +	skip-worktree bit in the index nor add/remove files in the working
> +	directory to reflect the current sparse checkout settings.

This reads a lot clearer, at least to me, than the documentation
update in the previous round, by speaking in terms of what the
user-visible side effect of this setting would be.

Nicely done.  Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] checkout: optimize "git checkout -b <new_branch>"
  2018-07-31 16:39 ` [PATCH v2] checkout: optimize "git checkout -b <new_branch>" Ben Peart
  2018-07-31 20:01   ` Junio C Hamano
@ 2018-08-01 15:10   ` Duy Nguyen
  2018-08-02 18:02     ` Ben Peart
  1 sibling, 1 reply; 32+ messages in thread
From: Duy Nguyen @ 2018-08-01 15:10 UTC (permalink / raw)
  To: Ben Peart; +Cc: Git Mailing List, Junio C Hamano, Eric Sunshine

On Tue, Jul 31, 2018 at 7:03 PM Ben Peart <Ben.Peart@microsoft.com> wrote:
>
> From: Ben Peart <Ben.Peart@microsoft.com>
>
> Skip merging the commit, updating the index and working directory if and
> only if we are creating a new branch via "git checkout -b <new_branch>."
> Any other checkout options will still go through the former code path.

I'd like to see this giant list of checks broken down and pushed down
to smaller areas so that chances of new things being added but checks
not updated become much smaller. And ideally there should just be no
behavior change (I think with your change, "checkout -b" will not
report local changes, but it's not mentioned in the config document;
more things like that can easily slip).

So. I assume this reason for this patch is because on super large worktrees

 - 2-way merge is too slow
 - applying spare checkout patterns on a huge worktree is also slow
 - writing index is, again, slow
 - show_local_changes() slow

For 2-way merge, I believe we can detect inside unpack_trees() that
it's a 2-way merge (fn == twoway_merge), from HEAD to HEAD (simple
enough check), then from the 2-way merge table we know for sure
nothing is going to change and we can just skip traverse_trees() call
in unpack_trees().

On the sparse checkout application. This only needs to be done when
there are new files added, or the spare-checkout file has been updated
since the last time it's been used. We can keep track of these things
(sparse-checkout file change could be kept track with just stat info
maybe as an index extension) then we can skip applying sparse checkout
not for this particular case for but general checkouts as well. Spare
checkout file rarely changes. Big win overall.

And if all go according to plan, there will be no changes made in the
index (by either 2-way merge or sparse checkout stuff) we should be
able to just skip writing down the index, if we haven't done that
already.

show_local_changes() should be sped up significantly with the new
cache-tree optimization I'm working on in another thread.

If I have not made any mistake in my analysis so far, we achieve a big
speedup without adding a new config knob and can still fall back to
slower-but-same-behavior when things are not in the right condition. I
know it will not be the same speedup as this patch because when facing
thousands of items, even counting them takes time. But I think it's a
reasonable speedup without making the code base more fragile.
-- 
Duy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] checkout: optimize "git checkout -b <new_branch>"
  2018-08-01 15:10   ` Duy Nguyen
@ 2018-08-02 18:02     ` Ben Peart
  2018-08-03 15:58       ` Duy Nguyen
  2018-08-05  8:57       ` Duy Nguyen
  0 siblings, 2 replies; 32+ messages in thread
From: Ben Peart @ 2018-08-02 18:02 UTC (permalink / raw)
  To: Duy Nguyen, Ben Peart; +Cc: Git Mailing List, Junio C Hamano, Eric Sunshine



On 8/1/2018 11:10 AM, Duy Nguyen wrote:
> On Tue, Jul 31, 2018 at 7:03 PM Ben Peart <Ben.Peart@microsoft.com> wrote:
>>
>> From: Ben Peart <Ben.Peart@microsoft.com>
>>
>> Skip merging the commit, updating the index and working directory if and
>> only if we are creating a new branch via "git checkout -b <new_branch>."
>> Any other checkout options will still go through the former code path.
> 
> I'd like to see this giant list of checks broken down and pushed down
> to smaller areas so that chances of new things being added but checks
> not updated become much smaller. And ideally there should just be no
> behavior change (I think with your change, "checkout -b" will not
> report local changes, but it's not mentioned in the config document;
> more things like that can easily slip).
> 

One trade off of pushing these optimizations down into the lower-level 
functions is that they have a greater potential to break other command 
if our assumptions are wrong.  Changing these low level functions is a 
much more invasive set of patches.

I didn't feel confident enough to pursue this path and instead, decided 
to do the single, high level optimization around the specific scenario. 
While it has its own drawbacks (the nasty set of conditions we're 
testing), the potential for breaking other commands is much smaller.

That said, I'm willing to look into the model of pushing the 
checks/optimizations down to smaller areas if we can 1) ensure we aren't 
breaking other commands and 2) we can get similar performance.

> So. I assume this reason for this patch is because on super large worktrees
> 
>   - 2-way merge is too slow
>   - applying spare checkout patterns on a huge worktree is also slow
>   - writing index is, again, slow
>   - show_local_changes() slow
> 

That is pretty close but here is some real data on a large repo.

"git checkout -b <new_branch>" with this patch takes 0.32 seconds.
"git checkout -b <new_branch>" without this patch takes 14.6 seconds.

Note, all numbers are with a hot disk cache - real world numbers for the 
unpatched case can be much worse as it has to do a lot of disk IO to 
read/write the 267 MB index, load 500K+ tree objects, etc:

Name                                      	Inc %	      Inc
  ||+ git!mainCRTStartup                   	 89.2	   13,380
  || + git!__tmainCRTStartup               	 89.2	   13,380
  ||  + git!cmd_main                       	 89.2	   13,380
  ||   + git!handle_builtin                	 89.2	   13,380
  ||    + git!cmd_checkout                 	 89.2	   13,380
  ||     + git!unpack_trees                	 71.5	   10,725
  ||     |+ git!traverse_trees             	 39.7	    5,956
  ||     |+ git!cache_tree_update          	 16.1	    2,408
  ||     |+ git!??unpack_callback          	 11.0	    1,649
  ||     |+ git!discard_index              	  2.8	      423
  ||     + git!write_locked_index          	  8.4	    1,257
  ||     + git!??cache_tree_invalidate_path	  5.1	      767
  ||     + git!read_index_preload          	  3.4	      514

> For 2-way merge, I believe we can detect inside unpack_trees() that
> it's a 2-way merge (fn == twoway_merge), from HEAD to HEAD (simple
> enough check), then from the 2-way merge table we know for sure
> nothing is going to change and we can just skip traverse_trees() call
> in unpack_trees().
> 

If we can skip the call to traverse_trees(), that will give us the bulk 
of the savings (39.7% + 11% = 50.7% if my math is correct).

> On the sparse checkout application. This only needs to be done when
> there are new files added, or the spare-checkout file has been updated
> since the last time it's been used. We can keep track of these things
> (sparse-checkout file change could be kept track with just stat info
> maybe as an index extension) then we can skip applying sparse checkout
> not for this particular case for but general checkouts as well. Spare
> checkout file rarely changes. Big win overall.
> 

With the current patch, we don't need to load or update the index at 
all.  Without the patch, we've already replaced the standard 
sparse-checkout logic with something significantly faster so in our 
particular case, I think it's safe to skip the additional complexity of 
keeping track of changes to the sparse-checkout file.

> And if all go according to plan, there will be no changes made in the
> index (by either 2-way merge or sparse checkout stuff) we should be
> able to just skip writing down the index, if we haven't done that
> already.
> 

That would be great as writing the index is 8.4% of the time spent.

> show_local_changes() should be sped up significantly with the new
> cache-tree optimization I'm working on in another thread.
> 

As you can see, updating the cache_tree is relatively expensive (16.1% + 
5.1%) so we would definitely benefit from any improvements there.

> If I have not made any mistake in my analysis so far, we achieve a big
> speedup without adding a new config knob and can still fall back to
> slower-but-same-behavior when things are not in the right condition. I
> know it will not be the same speedup as this patch because when facing
> thousands of items, even counting them takes time. But I think it's a
> reasonable speedup without making the code base more fragile.
> 

So my rough math says we can potentially save 50.7% + 8.4% + (x * 21.2%) 
(where x is the percentage savings with an optimized cache_tree).  If we 
assume x == 50%, that means we can save 69.7% of the overall time.

For comparison, that would put "git checkout -b <new_branch>" at:

0.3 seconds - with the current patch
4.4 seconds - with the proposed patch
14.6 seconds - with no patch

Am I missing anything?  Is my math wrong?  Any other ideas for how to 
improve the proposed patch?

Thanks,

Ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] checkout: optimize "git checkout -b <new_branch>"
  2018-08-02 18:02     ` Ben Peart
@ 2018-08-03 15:58       ` Duy Nguyen
  2018-08-06 14:25         ` Ben Peart
  2018-08-05  8:57       ` Duy Nguyen
  1 sibling, 1 reply; 32+ messages in thread
From: Duy Nguyen @ 2018-08-03 15:58 UTC (permalink / raw)
  To: Ben Peart; +Cc: Ben Peart, Git Mailing List, Junio C Hamano, Eric Sunshine

On Thu, Aug 02, 2018 at 02:02:00PM -0400, Ben Peart wrote:
> 
> 
> On 8/1/2018 11:10 AM, Duy Nguyen wrote:
> > On Tue, Jul 31, 2018 at 7:03 PM Ben Peart <Ben.Peart@microsoft.com> wrote:
> >>
> >> From: Ben Peart <Ben.Peart@microsoft.com>
> >>
> >> Skip merging the commit, updating the index and working directory if and
> >> only if we are creating a new branch via "git checkout -b <new_branch>."
> >> Any other checkout options will still go through the former code path.
> > 
> > I'd like to see this giant list of checks broken down and pushed down
> > to smaller areas so that chances of new things being added but checks
> > not updated become much smaller. And ideally there should just be no
> > behavior change (I think with your change, "checkout -b" will not
> > report local changes, but it's not mentioned in the config document;
> > more things like that can easily slip).
> > 
> 
> One trade off of pushing these optimizations down into the lower-level 
> functions is that they have a greater potential to break other command 
> if our assumptions are wrong.  Changing these low level functions is a 
> much more invasive set of patches.
> 
> I didn't feel confident enough to pursue this path and instead, decided 
> to do the single, high level optimization around the specific scenario. 
> While it has its own drawbacks (the nasty set of conditions we're 
> testing), the potential for breaking other commands is much smaller.
> 
> That said, I'm willing to look into the model of pushing the 
> checks/optimizations down to smaller areas if we can 1) ensure we aren't 
> breaking other commands and 2) we can get similar performance.
> 
> > So. I assume this reason for this patch is because on super large worktrees
> > 
> >   - 2-way merge is too slow
> >   - applying spare checkout patterns on a huge worktree is also slow
> >   - writing index is, again, slow
> >   - show_local_changes() slow
> > 
> 
> That is pretty close but here is some real data on a large repo.
> 
> "git checkout -b <new_branch>" with this patch takes 0.32 seconds.
> "git checkout -b <new_branch>" without this patch takes 14.6 seconds.
> 
> Note, all numbers are with a hot disk cache - real world numbers for the 
> unpatched case can be much worse as it has to do a lot of disk IO to 
> read/write the 267 MB index, load 500K+ tree objects, etc:
> 
> Name                                      	Inc %	      Inc
>   ||+ git!mainCRTStartup                   	 89.2	   13,380
>   || + git!__tmainCRTStartup               	 89.2	   13,380
>   ||  + git!cmd_main                       	 89.2	   13,380
>   ||   + git!handle_builtin                	 89.2	   13,380
>   ||    + git!cmd_checkout                 	 89.2	   13,380
>   ||     + git!unpack_trees                	 71.5	   10,725
>   ||     |+ git!traverse_trees             	 39.7	    5,956
>   ||     |+ git!cache_tree_update          	 16.1	    2,408
>   ||     |+ git!??unpack_callback          	 11.0	    1,649
>   ||     |+ git!discard_index              	  2.8	      423
>   ||     + git!write_locked_index          	  8.4	    1,257
>   ||     + git!??cache_tree_invalidate_path	  5.1	      767
>   ||     + git!read_index_preload          	  3.4	      514
> 
> > For 2-way merge, I believe we can detect inside unpack_trees() that
> > it's a 2-way merge (fn == twoway_merge), from HEAD to HEAD (simple
> > enough check), then from the 2-way merge table we know for sure
> > nothing is going to change and we can just skip traverse_trees() call
> > in unpack_trees().
> > 
> 
> If we can skip the call to traverse_trees(), that will give us the bulk 
> of the savings (39.7% + 11% = 50.7% if my math is correct).

That cache_tree_invalidate_path() should belong to unpack_trees() as
well. At least I known unpack_trees() does that (and I'm pretty sure
it's useless to call it in this context), but maybe it's called
elsewhere too.

> 
> > On the sparse checkout application. This only needs to be done when
> > there are new files added, or the spare-checkout file has been updated
> > since the last time it's been used. We can keep track of these things
> > (sparse-checkout file change could be kept track with just stat info
> > maybe as an index extension) then we can skip applying sparse checkout
> > not for this particular case for but general checkouts as well. Spare
> > checkout file rarely changes. Big win overall.
> > 
> 
> With the current patch, we don't need to load or update the index at 
> all.  Without the patch, we've already replaced the standard 
> sparse-checkout logic with something significantly faster so in our 
> particular case, I think it's safe to skip the additional complexity of 
> keeping track of changes to the sparse-checkout file.
> 
> > And if all go according to plan, there will be no changes made in the
> > index (by either 2-way merge or sparse checkout stuff) we should be
> > able to just skip writing down the index, if we haven't done that
> > already.
> > 
> 
> That would be great as writing the index is 8.4% of the time spent.
> 
> > show_local_changes() should be sped up significantly with the new
> > cache-tree optimization I'm working on in another thread.
> > 
> 
> As you can see, updating the cache_tree is relatively expensive (16.1% + 
> 5.1%) so we would definitely benefit from any improvements there.
> 
> > If I have not made any mistake in my analysis so far, we achieve a big
> > speedup without adding a new config knob and can still fall back to
> > slower-but-same-behavior when things are not in the right condition. I
> > know it will not be the same speedup as this patch because when facing
> > thousands of items, even counting them takes time. But I think it's a
> > reasonable speedup without making the code base more fragile.
> > 
> 
> So my rough math says we can potentially save 50.7% + 8.4% + (x * 21.2%) 
> (where x is the percentage savings with an optimized cache_tree).  If we 
> assume x == 50%, that means we can save 69.7% of the overall time.
> 
> For comparison, that would put "git checkout -b <new_branch>" at:
> 
> 0.3 seconds - with the current patch
> 4.4 seconds - with the proposed patch
> 14.6 seconds - with no patch

4.4 seconds is in my opnion already very reasonable compared to the
no-patch number, and other more often used commands, which I assume
take seconds as well. Once common commands take a lot less time, we
could have another look at this "checkout -b".

These numbers btw should be in the commit message if you send another
optimization patch because it's very unconvincing to say "optimize"
without numbers. My initial reaction was, the majority of open source
projects are of moderate size and "checkout" takes subseconds anyway,
what is the point of optimization (even if you could reduce 90% of
time) since it will not be noticeable anyway?

> Am I missing anything?  Is my math wrong?  Any other ideas for how to 
> improve the proposed patch?

But if you still want to push it further, this is something I have in
mind. It probably has bugs, but at least preliminary test shows me
that it could skip 99% work inside unpack_trees() and not need to
write the index.

The main check to determine "checkout -b" is basically the new
oidcmp() in merge_working_tree(). Not sure if I miss any condition in
that check, I didn't look very closely at checkout code.

-- 8< --
diff --git a/builtin/checkout.c b/builtin/checkout.c
index 28627650cd..912e565acc 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -478,6 +478,7 @@ static int merge_working_tree(const struct checkout_opts *opts,
 {
 	int ret;
 	struct lock_file lock_file = LOCK_INIT;
+	int skip_cache_tree_update = 0;
 
 	hold_locked_index(&lock_file, LOCK_DIE_ON_ERROR);
 	if (read_cache_preload(NULL) < 0)
@@ -493,6 +494,7 @@ static int merge_working_tree(const struct checkout_opts *opts,
 		struct tree_desc trees[2];
 		struct tree *tree;
 		struct unpack_trees_options topts;
+		int nr_trees;
 
 		memset(&topts, 0, sizeof(topts));
 		topts.head_idx = -1;
@@ -514,20 +516,45 @@ static int merge_working_tree(const struct checkout_opts *opts,
 		topts.merge = 1;
 		topts.gently = opts->merge && old_branch_info->commit;
 		topts.verbose_update = opts->show_progress;
-		topts.fn = twoway_merge;
+		/*
+		 * We repair cache-tree explicitly in this function,
+		 * no point in doing it again inside unpack_trees()
+		 */
+		topts.skip_cache_tree_update = 1;
 		if (opts->overwrite_ignore) {
 			topts.dir = xcalloc(1, sizeof(*topts.dir));
 			topts.dir->flags |= DIR_SHOW_IGNORED;
 			setup_standard_excludes(topts.dir);
 		}
-		tree = parse_tree_indirect(old_branch_info->commit ?
-					   &old_branch_info->commit->object.oid :
-					   the_hash_algo->empty_tree);
-		init_tree_desc(&trees[0], tree->buffer, tree->size);
-		tree = parse_tree_indirect(&new_branch_info->commit->object.oid);
-		init_tree_desc(&trees[1], tree->buffer, tree->size);
-
-		ret = unpack_trees(2, trees, &topts);
+		if (!oidcmp(&old_branch_info->commit->object.oid,
+			    &new_branch_info->commit->object.oid)) {
+			/*
+			 * Skip 2-way merge because we know the end
+			 * result would be the same anyway (to double
+			 * check).
+			 */
+			topts.fn = noop_merge;
+			nr_trees = 0;
+			/*
+			 * Repairing cache-tree is an optional optimization
+			 * and could be safely skipped. Maybe have some
+			 * heuristics based on the_index->cache_nr to
+			 * determine if updating cache-tree could be too
+			 * costly (including index writing cost) and skip it
+			 * instead.
+			 */
+			skip_cache_tree_update = 1;
+		} else {
+			topts.fn = twoway_merge;
+			tree = parse_tree_indirect(old_branch_info->commit ?
+						   &old_branch_info->commit->object.oid :
+						   the_hash_algo->empty_tree);
+			init_tree_desc(&trees[0], tree->buffer, tree->size);
+			tree = parse_tree_indirect(&new_branch_info->commit->object.oid);
+			init_tree_desc(&trees[1], tree->buffer, tree->size);
+			nr_trees = 2;
+		}
+		ret = unpack_trees(nr_trees, trees, &topts);
 		clear_unpack_trees_porcelain(&topts);
 		if (ret == -1) {
 			/*
@@ -597,13 +624,15 @@ static int merge_working_tree(const struct checkout_opts *opts,
 		}
 	}
 
-	if (!active_cache_tree)
-		active_cache_tree = cache_tree();
+	if (!skip_cache_tree_update) {
+		if (!active_cache_tree)
+			active_cache_tree = cache_tree();
 
-	if (!cache_tree_fully_valid(active_cache_tree))
-		cache_tree_update(&the_index, WRITE_TREE_SILENT | WRITE_TREE_REPAIR);
+		if (!cache_tree_fully_valid(active_cache_tree))
+			cache_tree_update(&the_index, WRITE_TREE_SILENT | WRITE_TREE_REPAIR);
+	}
 
-	if (write_locked_index(&the_index, &lock_file, COMMIT_LOCK))
+	if (write_locked_index(&the_index, &lock_file, COMMIT_LOCK | SKIP_IF_UNCHANGED))
 		die(_("unable to write new index file"));
 
 	if (!opts->force && !opts->quiet)
diff --git a/unpack-trees.c b/unpack-trees.c
index cd0680f11e..007d6662f2 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1275,6 +1275,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	int i, ret;
 	static struct cache_entry *dfc;
 	struct exclude_list el;
+	int result_is_actually_src = 0;
 
 	if (len > MAX_UNPACK_TREES)
 		die("unpack_trees takes at most %d trees", MAX_UNPACK_TREES);
@@ -1353,6 +1354,10 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 
 		if (traverse_trees(len, t, &info) < 0)
 			goto return_failed;
+	} else if (o->fn == noop_merge) {
+		o->result = *o->src_index;
+		o->cache_bottom = o->src_index->cache_nr;
+		result_is_actually_src = 1;
 	}
 
 	/* Any left-over entries in the index? */
@@ -1426,7 +1431,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 
 	ret = check_updates(o) ? (-2) : 0;
 	if (o->dst_index) {
-		if (!ret) {
+		if (!ret &&!o->skip_cache_tree_update) {
 			if (!o->result.cache_tree)
 				o->result.cache_tree = cache_tree();
 			if (!cache_tree_fully_valid(o->result.cache_tree))
@@ -1435,7 +1440,10 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 						  WRITE_TREE_REPAIR);
 		}
 		move_index_extensions(&o->result, o->src_index);
-		discard_index(o->dst_index);
+		if (result_is_actually_src && o->dst_index == o->src_index)
+			; /* no discarding */
+		else
+			discard_index(o->dst_index);
 		*o->dst_index = o->result;
 	} else {
 		discard_index(&o->result);
@@ -2193,3 +2201,9 @@ int oneway_merge(const struct cache_entry * const *src,
 	}
 	return merged_entry(a, old, o);
 }
+
+int noop_merge(const struct cache_entry * const *src,
+	       struct unpack_trees_options *o)
+{
+	BUG("This is basically a flag and not supposed to be called");
+}
diff --git a/unpack-trees.h b/unpack-trees.h
index c2b434c606..3e5bce19a1 100644
--- a/unpack-trees.h
+++ b/unpack-trees.h
@@ -52,6 +52,7 @@ struct unpack_trees_options {
 		     diff_index_cached,
 		     debug_unpack,
 		     skip_sparse_checkout,
+		     skip_cache_tree_update,
 		     gently,
 		     exiting_early,
 		     show_all_errors,
@@ -96,5 +97,7 @@ int bind_merge(const struct cache_entry * const *src,
 	       struct unpack_trees_options *o);
 int oneway_merge(const struct cache_entry * const *src,
 		 struct unpack_trees_options *o);
+int noop_merge(const struct cache_entry * const *src,
+	       struct unpack_trees_options *o);
 
 #endif
-- 8< --

--
Duy

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] checkout: optimize "git checkout -b <new_branch>"
  2018-08-02 18:02     ` Ben Peart
  2018-08-03 15:58       ` Duy Nguyen
@ 2018-08-05  8:57       ` Duy Nguyen
  1 sibling, 0 replies; 32+ messages in thread
From: Duy Nguyen @ 2018-08-05  8:57 UTC (permalink / raw)
  To: Ben Peart; +Cc: Ben Peart, Git Mailing List, Junio C Hamano, Eric Sunshine

On Thu, Aug 02, 2018 at 02:02:00PM -0400, Ben Peart wrote:
> > And if all go according to plan, there will be no changes made in the
> > index (by either 2-way merge or sparse checkout stuff) we should be
> > able to just skip writing down the index, if we haven't done that
> > already.
> > 
> 
> That would be great as writing the index is 8.4% of the time spent.

And that is of course not good. Avoiding writing index at all is nice
and all but we should not take that much time writing it (what about
other commands?)

I have another idea to reduce index size, which means cheaper trailer
hash calculation, but this may also be a direction you could look
into.

I notice that we have a lot of duplicate data in the index. For
example, most files would have the same ctime and even mtime (at least
the seconds part). group id, device id and file mode should also be
the same in normal case. I guess on Windows, inode is always the same
as well. So why store (and hash) them repeatedly?

This patch writes "index version 5" that only writes those fields out
if they are different from the previous entry (we already do this for
path name in v4).

On webkit repo (275k files), writing time can be reduced from 0.314s
(v2) to 0.206s (v5). File size of all versions:

 9.8M Aug  5 10:27 .git/index.gz
  38M Aug  5 09:54 .git/index-v2
  22M Aug  5 10:26 .git/index-v4
  15M Aug  5 10:34 .git/index-v5

As you can see there's still room to compress, but I don't think we
want to go full deflate mode because compression time goes up.

Another option I think you could consider is not storing full hash. As
long as the shortened hash is still unique, we can expand to full hash
at read time. Of course you can't go too short (I'm thinking half the
current hash size is reasonable trade off) and need to take into
account how much time will be lost for looking up and expanding these
short hashes at read time.

-- 8< --
diff --git a/cache.h b/cache.h
index e6f7ee4b64..1d6bc7d122 100644
--- a/cache.h
+++ b/cache.h
@@ -129,7 +129,7 @@ struct cache_header {
 };
 
 #define INDEX_FORMAT_LB 2
-#define INDEX_FORMAT_UB 4
+#define INDEX_FORMAT_UB 5
 
 /*
  * The "cache_time" is just the low 32 bits of the
diff --git a/read-cache.c b/read-cache.c
index b0b5df5de7..9bb095d6b2 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2002,7 +2002,7 @@ static int ce_write_flush(git_hash_ctx *context, int fd)
 	return 0;
 }
 
-static int ce_write(git_hash_ctx *context, int fd, void *data, unsigned int len)
+static int ce_write(git_hash_ctx *context, int fd, const void *data, unsigned int len)
 {
 	while (len) {
 		unsigned int buffered = write_buffer_len;
@@ -2024,6 +2024,18 @@ static int ce_write(git_hash_ctx *context, int fd, void *data, unsigned int len)
 	return 0;
 }
 
+static int ce_write_u32(git_hash_ctx *context, int fd, uint32_t data)
+{
+	data = htonl(data);
+	return ce_write(context, fd, &data, sizeof(data));
+}
+
+static int ce_write_u16(git_hash_ctx *context, int fd, uint16_t data)
+{
+	data = htons(data);
+	return ce_write(context, fd, &data, sizeof(data));
+}
+
 static int write_index_ext_header(git_hash_ctx *context, int fd,
 				  unsigned int ext, unsigned int sz)
 {
@@ -2133,8 +2145,46 @@ static void copy_cache_entry_to_ondisk(struct ondisk_cache_entry *ondisk,
 	}
 }
 
+#define ONDISK_MASK (CE_STAGEMASK | CE_INTENT_TO_ADD | CE_SKIP_WORKTREE)
+
+static int ce_write_entry_v5(git_hash_ctx *c, int fd,
+			     const struct cache_entry *ce,
+			     const struct cache_entry *pce)
+{
+	uint8_t mask = 0;
+	const struct stat_data *st1 = &ce->ce_stat_data;
+	const struct stat_data *st2 = &pce->ce_stat_data;
+	//int ret = 0;
+
+	mask |= (st1->sd_ctime.sec != st2->sd_ctime.sec) << 0;
+	mask |= (st1->sd_mtime.sec != st2->sd_mtime.sec) << 1;
+	mask |= (st1->sd_dev != st2->sd_dev) << 2;
+	mask |= (st1->sd_ino != st2->sd_ino) << 3;
+	mask |= (ce->ce_mode != pce->ce_mode) << 4;
+	mask |= (st1->sd_uid != st2->sd_uid) << 5;
+	mask |= (st1->sd_gid != st2->sd_gid) << 6;
+	mask |= ((ce->ce_flags & ONDISK_MASK) != (pce->ce_flags & ONDISK_MASK)) << 7;
+
+	ce_write(c, fd, &mask, 1);
+	ce_write(c, fd, ce->oid.hash, GIT_SHA1_RAWSZ);
+	ce_write_u32(c, fd, st1->sd_ctime.nsec);
+	ce_write_u32(c, fd, st1->sd_mtime.nsec);
+	ce_write_u32(c, fd, st1->sd_size);
+	if (mask & (1 << 0)) ce_write_u32(c, fd, st1->sd_ctime.sec);
+	if (mask & (1 << 1)) ce_write_u32(c, fd, st1->sd_mtime.sec);
+	if (mask & (1 << 2)) ce_write_u32(c, fd, st1->sd_dev);
+	if (mask & (1 << 3)) ce_write_u32(c, fd, st1->sd_ino);
+	if (mask & (1 << 4)) ce_write_u16(c, fd, ce->ce_mode);
+	if (mask & (1 << 5)) ce_write_u32(c, fd, st1->sd_uid);
+	if (mask & (1 << 6)) ce_write_u32(c, fd, st1->sd_gid);
+	if (mask & (1 << 7)) ce_write_u32(c, fd, ce->ce_flags & ONDISK_MASK);
+	return 0;
+}
+
 static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce,
-			  struct strbuf *previous_name, struct ondisk_cache_entry *ondisk)
+			  const struct cache_entry *pce,
+			  struct strbuf *previous_name,
+			  struct ondisk_cache_entry *ondisk)
 {
 	int size;
 	int result;
@@ -2173,8 +2223,12 @@ static int ce_write_entry(git_hash_ctx *c, int fd, struct cache_entry *ce,
 		to_remove = previous_name->len - common;
 		prefix_size = encode_varint(to_remove, to_remove_vi);
 
-		copy_cache_entry_to_ondisk(ondisk, ce);
-		result = ce_write(c, fd, ondisk, size);
+		if (!pce) {
+			copy_cache_entry_to_ondisk(ondisk, ce);
+			result = ce_write(c, fd, ondisk, size);
+		} else {
+			result = ce_write_entry_v5(c, fd, ce, pce);
+		}
 		if (!result)
 			result = ce_write(c, fd, to_remove_vi, prefix_size);
 		if (!result)
@@ -2313,7 +2367,7 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 	if (ce_write(&c, newfd, &hdr, sizeof(hdr)) < 0)
 		return -1;
 
-	previous_name = (hdr_version == 4) ? &previous_name_buf : NULL;
+	previous_name = (hdr_version >= 4) ? &previous_name_buf : NULL;
 
 	for (i = 0; i < entries; i++) {
 		struct cache_entry *ce = cache[i];
@@ -2334,7 +2388,10 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 
 			drop_cache_tree = 1;
 		}
-		if (ce_write_entry(&c, newfd, ce, previous_name, (struct ondisk_cache_entry *)&ondisk) < 0)
+		if (ce_write_entry(&c, newfd, ce,
+				   hdr_version >= 5 && i > 0 ? cache[i - 1] : NULL,
+				   previous_name,
+				   (struct ondisk_cache_entry *)&ondisk) < 0)
 			err = -1;
 
 		if (err)
-- 8< --
--
Duy

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] checkout: optimize "git checkout -b <new_branch>"
  2018-08-03 15:58       ` Duy Nguyen
@ 2018-08-06 14:25         ` Ben Peart
  2018-08-15 21:05           ` Ben Peart
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Peart @ 2018-08-06 14:25 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Ben Peart, Git Mailing List, Junio C Hamano, Eric Sunshine



On 8/3/2018 11:58 AM, Duy Nguyen wrote:
> On Thu, Aug 02, 2018 at 02:02:00PM -0400, Ben Peart wrote:
>>
>>

> But if you still want to push it further, this is something I have in
> mind. It probably has bugs, but at least preliminary test shows me
> that it could skip 99% work inside unpack_trees() and not need to
> write the index.
> 
> The main check to determine "checkout -b" is basically the new
> oidcmp() in merge_working_tree(). Not sure if I miss any condition in
> that check, I didn't look very closely at checkout code.
> 

Thanks Duy.  I think this is an interesting idea to pursue but... when I 
tried running this patch on a virtualized repo it started triggering 
many object downloads.  After taking a quick look, it appears that 
CE_UPDATE is set on every cache entry so check_updates() ends up calling 
checkout_entry() which writes out every file to the working tree - even 
those supposedly skipped by the skip-wortree bit.  Oops.

Not too surprising (you did say it probably has bugs :)) but it means I 
can't trivially get performance data on how much this will help.  It 
also fails a lot of tests (see below).

It experience does highlight the additional risk of this model of 
changing the underlying functions (vs the high level optimization of my 
original patch).  In addition, the new special cases in those 
lower-level functions do add additional complexity and fragility to the 
codebase.  So, like most things, to me it isn't a clear better/worse 
decision - it's just different.  While I like the idea of general 
optimizations that could apply more broadly to other commands; I do 
worry about the additional complexity, amount of code churn, and 
associated risk with the change.

When I have cycles, I'll take a look at how to fix this bug and get some 
performance data.  I just wanted to give you a heads up that I'm not 
ignoring your patch, just that it is going to take additional time and 
effort before I can properly evaluate how much impact it will have.


Test Summary Report
-------------------
./t1011-read-tree-sparse-checkout.sh               (Wstat: 256 Tests: 21 
Failed: 1)
   Failed test:  20
   Non-zero exit status: 1
./t1400-update-ref.sh                              (Wstat: 256 Tests: 
170 Failed: 73)
   Failed tests:  40, 42-45, 55-59, 70, 72, 82, 85, 87-88
                 90-100, 103-110, 113-119, 127, 129-130
                 132-133, 136-137, 140-147, 150-157, 160-166
                 170
   Non-zero exit status: 1
./t2011-checkout-invalid-head.sh                   (Wstat: 256 Tests: 10 
Failed: 5)
   Failed tests:  3, 6-7, 9-10
   Non-zero exit status: 1
./t2015-checkout-unborn.sh                         (Wstat: 256 Tests: 6 
Failed: 3)
   Failed tests:  2-4
   Non-zero exit status: 1
./t2017-checkout-orphan.sh                         (Wstat: 256 Tests: 13 
Failed: 7)
   Failed tests:  7-13
   Non-zero exit status: 1
./t3033-merge-toplevel.sh                          (Wstat: 256 Tests: 13 
Failed: 11)
   Failed tests:  3-13
   Non-zero exit status: 1
./t3200-branch.sh                                  (Wstat: 256 Tests: 
139 Failed: 2)
   Failed tests:  137-138
   Non-zero exit status: 1
./t5616-partial-clone.sh                           (Wstat: 256 Tests: 13 
Failed: 1)
   Failed test:  4
   Non-zero exit status: 1
./t5516-fetch-push.sh                              (Wstat: 256 Tests: 90 
Failed: 1)
   Failed test:  34
   Non-zero exit status: 1
./t6300-for-each-ref.sh                            (Wstat: 256 Tests: 
205 Failed: 9)
   Failed tests:  189-196, 199
   Non-zero exit status: 1
./t7114-reset-sparse-checkout.sh                   (Wstat: 256 Tests: 3 
Failed: 2)
   Failed tests:  2-3
   Non-zero exit status: 1
./t7063-status-untracked-cache.sh                  (Wstat: 256 Tests: 50 
Failed: 1)
   Failed test:  23
   Non-zero exit status: 1
./t7201-co.sh                                      (Wstat: 256 Tests: 38 
Failed: 33)
   Failed tests:  4, 6-27, 29-38
   Non-zero exit status: 1
./t7409-submodule-detached-work-tree.sh            (Wstat: 256 Tests: 2 
Failed: 1)
   Failed test:  1
   Non-zero exit status: 1
./t9350-fast-export.sh                             (Wstat: 256 Tests: 37 
Failed: 1)
   Failed test:  12
   Non-zero exit status: 1
./t9903-bash-prompt.sh                             (Wstat: 256 Tests: 65 
Failed: 52)
   Failed tests:  4, 6-10, 14-34, 36, 39-51, 53-62, 65
   Non-zero exit status: 1
Files=834, Tests=19658, 2081 wallclock secs (10.42 usr 15.09 sys + 
1082.56 cusr 3530.46 csys = 4638.53 CPU)
Result: FAIL


> -- 8< --
> diff --git a/builtin/checkout.c b/builtin/checkout.c
> index 28627650cd..912e565acc 100644
> --- a/builtin/checkout.c
> +++ b/builtin/checkout.c
> @@ -478,6 +478,7 @@ static int merge_working_tree(const struct checkout_opts *opts,
>   {
>   	int ret;
>   	struct lock_file lock_file = LOCK_INIT;
> +	int skip_cache_tree_update = 0;
>   
>   	hold_locked_index(&lock_file, LOCK_DIE_ON_ERROR);
>   	if (read_cache_preload(NULL) < 0)
> @@ -493,6 +494,7 @@ static int merge_working_tree(const struct checkout_opts *opts,
>   		struct tree_desc trees[2];
>   		struct tree *tree;
>   		struct unpack_trees_options topts;
> +		int nr_trees;
>   
>   		memset(&topts, 0, sizeof(topts));
>   		topts.head_idx = -1;
> @@ -514,20 +516,45 @@ static int merge_working_tree(const struct checkout_opts *opts,
>   		topts.merge = 1;
>   		topts.gently = opts->merge && old_branch_info->commit;
>   		topts.verbose_update = opts->show_progress;
> -		topts.fn = twoway_merge;
> +		/*
> +		 * We repair cache-tree explicitly in this function,
> +		 * no point in doing it again inside unpack_trees()
> +		 */
> +		topts.skip_cache_tree_update = 1;
>   		if (opts->overwrite_ignore) {
>   			topts.dir = xcalloc(1, sizeof(*topts.dir));
>   			topts.dir->flags |= DIR_SHOW_IGNORED;
>   			setup_standard_excludes(topts.dir);
>   		}
> -		tree = parse_tree_indirect(old_branch_info->commit ?
> -					   &old_branch_info->commit->object.oid :
> -					   the_hash_algo->empty_tree);
> -		init_tree_desc(&trees[0], tree->buffer, tree->size);
> -		tree = parse_tree_indirect(&new_branch_info->commit->object.oid);
> -		init_tree_desc(&trees[1], tree->buffer, tree->size);
> -
> -		ret = unpack_trees(2, trees, &topts);
> +		if (!oidcmp(&old_branch_info->commit->object.oid,
> +			    &new_branch_info->commit->object.oid)) {
> +			/*
> +			 * Skip 2-way merge because we know the end
> +			 * result would be the same anyway (to double
> +			 * check).
> +			 */
> +			topts.fn = noop_merge;
> +			nr_trees = 0;
> +			/*
> +			 * Repairing cache-tree is an optional optimization
> +			 * and could be safely skipped. Maybe have some
> +			 * heuristics based on the_index->cache_nr to
> +			 * determine if updating cache-tree could be too
> +			 * costly (including index writing cost) and skip it
> +			 * instead.
> +			 */
> +			skip_cache_tree_update = 1;
> +		} else {
> +			topts.fn = twoway_merge;
> +			tree = parse_tree_indirect(old_branch_info->commit ?
> +						   &old_branch_info->commit->object.oid :
> +						   the_hash_algo->empty_tree);
> +			init_tree_desc(&trees[0], tree->buffer, tree->size);
> +			tree = parse_tree_indirect(&new_branch_info->commit->object.oid);
> +			init_tree_desc(&trees[1], tree->buffer, tree->size);
> +			nr_trees = 2;
> +		}
> +		ret = unpack_trees(nr_trees, trees, &topts);
>   		clear_unpack_trees_porcelain(&topts);
>   		if (ret == -1) {
>   			/*
> @@ -597,13 +624,15 @@ static int merge_working_tree(const struct checkout_opts *opts,
>   		}
>   	}
>   
> -	if (!active_cache_tree)
> -		active_cache_tree = cache_tree();
> +	if (!skip_cache_tree_update) {
> +		if (!active_cache_tree)
> +			active_cache_tree = cache_tree();
>   
> -	if (!cache_tree_fully_valid(active_cache_tree))
> -		cache_tree_update(&the_index, WRITE_TREE_SILENT | WRITE_TREE_REPAIR);
> +		if (!cache_tree_fully_valid(active_cache_tree))
> +			cache_tree_update(&the_index, WRITE_TREE_SILENT | WRITE_TREE_REPAIR);
> +	}
>   
> -	if (write_locked_index(&the_index, &lock_file, COMMIT_LOCK))
> +	if (write_locked_index(&the_index, &lock_file, COMMIT_LOCK | SKIP_IF_UNCHANGED))
>   		die(_("unable to write new index file"));
>   
>   	if (!opts->force && !opts->quiet)
> diff --git a/unpack-trees.c b/unpack-trees.c
> index cd0680f11e..007d6662f2 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -1275,6 +1275,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>   	int i, ret;
>   	static struct cache_entry *dfc;
>   	struct exclude_list el;
> +	int result_is_actually_src = 0;
>   
>   	if (len > MAX_UNPACK_TREES)
>   		die("unpack_trees takes at most %d trees", MAX_UNPACK_TREES);
> @@ -1353,6 +1354,10 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>   
>   		if (traverse_trees(len, t, &info) < 0)
>   			goto return_failed;
> +	} else if (o->fn == noop_merge) {
> +		o->result = *o->src_index;
> +		o->cache_bottom = o->src_index->cache_nr;
> +		result_is_actually_src = 1;
>   	}
>   
>   	/* Any left-over entries in the index? */
> @@ -1426,7 +1431,7 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>   
>   	ret = check_updates(o) ? (-2) : 0;
>   	if (o->dst_index) {
> -		if (!ret) {
> +		if (!ret &&!o->skip_cache_tree_update) {
>   			if (!o->result.cache_tree)
>   				o->result.cache_tree = cache_tree();
>   			if (!cache_tree_fully_valid(o->result.cache_tree))
> @@ -1435,7 +1440,10 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>   						  WRITE_TREE_REPAIR);
>   		}
>   		move_index_extensions(&o->result, o->src_index);
> -		discard_index(o->dst_index);
> +		if (result_is_actually_src && o->dst_index == o->src_index)
> +			; /* no discarding */
> +		else
> +			discard_index(o->dst_index);
>   		*o->dst_index = o->result;
>   	} else {
>   		discard_index(&o->result);
> @@ -2193,3 +2201,9 @@ int oneway_merge(const struct cache_entry * const *src,
>   	}
>   	return merged_entry(a, old, o);
>   }
> +
> +int noop_merge(const struct cache_entry * const *src,
> +	       struct unpack_trees_options *o)
> +{
> +	BUG("This is basically a flag and not supposed to be called");
> +}
> diff --git a/unpack-trees.h b/unpack-trees.h
> index c2b434c606..3e5bce19a1 100644
> --- a/unpack-trees.h
> +++ b/unpack-trees.h
> @@ -52,6 +52,7 @@ struct unpack_trees_options {
>   		     diff_index_cached,
>   		     debug_unpack,
>   		     skip_sparse_checkout,
> +		     skip_cache_tree_update,
>   		     gently,
>   		     exiting_early,
>   		     show_all_errors,
> @@ -96,5 +97,7 @@ int bind_merge(const struct cache_entry * const *src,
>   	       struct unpack_trees_options *o);
>   int oneway_merge(const struct cache_entry * const *src,
>   		 struct unpack_trees_options *o);
> +int noop_merge(const struct cache_entry * const *src,
> +	       struct unpack_trees_options *o);
>   
>   #endif
> -- 8< --
> 
> --
> Duy
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] checkout: optimize "git checkout -b <new_branch>"
  2018-08-06 14:25         ` Ben Peart
@ 2018-08-15 21:05           ` Ben Peart
  0 siblings, 0 replies; 32+ messages in thread
From: Ben Peart @ 2018-08-15 21:05 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Ben Peart, Git Mailing List, Junio C Hamano, Eric Sunshine



On 8/6/2018 10:25 AM, Ben Peart wrote:
> 
> 
> On 8/3/2018 11:58 AM, Duy Nguyen wrote:
>> On Thu, Aug 02, 2018 at 02:02:00PM -0400, Ben Peart wrote:
>>>
>>>
> 
>> But if you still want to push it further, this is something I have in
>> mind. It probably has bugs, but at least preliminary test shows me
>> that it could skip 99% work inside unpack_trees() and not need to
>> write the index.
>>
>> The main check to determine "checkout -b" is basically the new
>> oidcmp() in merge_working_tree(). Not sure if I miss any condition in
>> that check, I didn't look very closely at checkout code.
>>
> 
> Thanks Duy.  I think this is an interesting idea to pursue but... when I 
> tried running this patch on a virtualized repo it started triggering 
> many object downloads.  After taking a quick look, it appears that 
> CE_UPDATE is set on every cache entry so check_updates() ends up calling 
> checkout_entry() which writes out every file to the working tree - even 
> those supposedly skipped by the skip-wortree bit.  Oops.
> 
> Not too surprising (you did say it probably has bugs :)) but it means I 
> can't trivially get performance data on how much this will help.  It 
> also fails a lot of tests (see below).
> 
> It experience does highlight the additional risk of this model of 
> changing the underlying functions (vs the high level optimization of my 
> original patch).  In addition, the new special cases in those 
> lower-level functions do add additional complexity and fragility to the 
> codebase.  So, like most things, to me it isn't a clear better/worse 
> decision - it's just different.  While I like the idea of general 
> optimizations that could apply more broadly to other commands; I do 
> worry about the additional complexity, amount of code churn, and 
> associated risk with the change.
> 
> When I have cycles, I'll take a look at how to fix this bug and get some 
> performance data.  I just wanted to give you a heads up that I'm not 
> ignoring your patch, just that it is going to take additional time and 
> effort before I can properly evaluate how much impact it will have.
> 

Now that the unpack-trees and cache-tree optimizations are settling 
down, I took a look at this proposed patch again with the intent of 
debugging why so many tests were broken by it.

The most obvious first fix was for all the segment faults when 
dereferencing a NULL pointer.  Adding an additional test so that we only 
perform the optimization when we actually have commit ID's to compare 
fixed a bunch of the test failures.

The next fix was to resolve all the breaks caused by applying this 
optimization when sparse-checkout is turned on.  Since we are skipping 
the logic to update the skip-worktree bit, I added an additional test so 
that we only perform the optimization when sparse checkout is not turned 
on.  Of course, this does completely remove the optimization when using 
sparse checkout so it isn't a workable permanent solution but it let me 
make progress.

There are still test failures with submodules and partial clone.  I 
haven't found/added the necessary tests to prevent those breaks nor the 
few other remaining breaks.

My current set of tests looks like this:

	if (!core_apply_sparse_checkout &&
		old_branch_info->commit &&
		new_branch_info->commit &&
		!oidcmp(&old_branch_info->commit->object.oid,
			&new_branch_info->commit->object.oid)) {

While I'm sure I could find and add additional tests to handle the 
remaining bugs, the net result is starting to look as fragile as the 
original patch.

Unfortunately it has the additional downsides of 1) being at a much 
lower level where we risk breaking more code paths and 2) not being 
nearly as much savings (with the original patch checkout -b <new branch> 
takes 0.3 seconds, this patch will make it take >4 seconds.)

Net, net - I don't think this particular path is a better path to pursue.

I understand the concern with the fragility of the current patch and 
it's set of tests to determine if the optimization is valid.  I also 
understand the concern with the potential change in behavior (ie not 
showing the local changes - even though nothing has changed).  Other 
than switching the optimization back to be "opt-in" via a config flag, I 
don't currently have a great answer.  I'll keep thinking and looking but 
am open to suggestions!



> Test Summary Report
> -------------------
> ./t1011-read-tree-sparse-checkout.sh               (Wstat: 256 Tests: 21 
> Failed: 1)
>    Failed test:  20
>    Non-zero exit status: 1
> ./t1400-update-ref.sh                              (Wstat: 256 Tests: 
> 170 Failed: 73)
>    Failed tests:  40, 42-45, 55-59, 70, 72, 82, 85, 87-88
>                  90-100, 103-110, 113-119, 127, 129-130
>                  132-133, 136-137, 140-147, 150-157, 160-166
>                  170
>    Non-zero exit status: 1
> ./t2011-checkout-invalid-head.sh                   (Wstat: 256 Tests: 10 
> Failed: 5)
>    Failed tests:  3, 6-7, 9-10
>    Non-zero exit status: 1
> ./t2015-checkout-unborn.sh                         (Wstat: 256 Tests: 6 
> Failed: 3)
>    Failed tests:  2-4
>    Non-zero exit status: 1
> ./t2017-checkout-orphan.sh                         (Wstat: 256 Tests: 13 
> Failed: 7)
>    Failed tests:  7-13
>    Non-zero exit status: 1
> ./t3033-merge-toplevel.sh                          (Wstat: 256 Tests: 13 
> Failed: 11)
>    Failed tests:  3-13
>    Non-zero exit status: 1
> ./t3200-branch.sh                                  (Wstat: 256 Tests: 
> 139 Failed: 2)
>    Failed tests:  137-138
>    Non-zero exit status: 1
> ./t5616-partial-clone.sh                           (Wstat: 256 Tests: 13 
> Failed: 1)
>    Failed test:  4
>    Non-zero exit status: 1
> ./t5516-fetch-push.sh                              (Wstat: 256 Tests: 90 
> Failed: 1)
>    Failed test:  34
>    Non-zero exit status: 1
> ./t6300-for-each-ref.sh                            (Wstat: 256 Tests: 
> 205 Failed: 9)
>    Failed tests:  189-196, 199
>    Non-zero exit status: 1
> ./t7114-reset-sparse-checkout.sh                   (Wstat: 256 Tests: 3 
> Failed: 2)
>    Failed tests:  2-3
>    Non-zero exit status: 1
> ./t7063-status-untracked-cache.sh                  (Wstat: 256 Tests: 50 
> Failed: 1)
>    Failed test:  23
>    Non-zero exit status: 1
> ./t7201-co.sh                                      (Wstat: 256 Tests: 38 
> Failed: 33)
>    Failed tests:  4, 6-27, 29-38
>    Non-zero exit status: 1
> ./t7409-submodule-detached-work-tree.sh            (Wstat: 256 Tests: 2 
> Failed: 1)
>    Failed test:  1
>    Non-zero exit status: 1
> ./t9350-fast-export.sh                             (Wstat: 256 Tests: 37 
> Failed: 1)
>    Failed test:  12
>    Non-zero exit status: 1
> ./t9903-bash-prompt.sh                             (Wstat: 256 Tests: 65 
> Failed: 52)
>    Failed tests:  4, 6-10, 14-34, 36, 39-51, 53-62, 65
>    Non-zero exit status: 1
> Files=834, Tests=19658, 2081 wallclock secs (10.42 usr 15.09 sys + 
> 1082.56 cusr 3530.46 csys = 4638.53 CPU)
> Result: FAIL
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3] checkout: optimize "git checkout -b <new_branch>"
  2018-07-24 18:01 [PATCH v1] checkout: optionally speed up "git checkout -b foo" Ben Peart
                   ` (2 preceding siblings ...)
  2018-07-31 16:39 ` [PATCH v2] checkout: optimize "git checkout -b <new_branch>" Ben Peart
@ 2018-08-16 18:27 ` Ben Peart
  2018-08-16 18:37   ` Duy Nguyen
  2018-09-18  5:34   ` [PATCH] config doc: add missing list separator for checkout.optimizeNewBranch Ævar Arnfjörð Bjarmason
  3 siblings, 2 replies; 32+ messages in thread
From: Ben Peart @ 2018-08-16 18:27 UTC (permalink / raw)
  To: Ben Peart
  Cc: git@vger.kernel.org, gitster@pobox.com, pclouds@gmail.com,
	sunshine@sunshineco.com, Ben Peart

From: Ben Peart <Ben.Peart@microsoft.com>

Skip merging the commit, updating the index and working directory if and
only if we are creating a new branch via "git checkout -b <new_branch>."
Any other checkout options will still go through the former code path.

If sparse_checkout is on, require the user to manually opt in to this
optimzed behavior by setting the config setting checkout.optimizeNewBranch
to true as we will no longer update the skip-worktree bit in the index, nor
add/remove files in the working directory to reflect the current sparse
checkout settings.

For comparison, running "git checkout -b <new_branch>" on a large repo takes:

14.6 seconds - without this patch
0.3 seconds - with this patch

Signed-off-by: Ben Peart <Ben.Peart@microsoft.com>
---

The biggest change in this version is that I have added the logic to call
show_local_changes() on the optimized path.  Since this can be expensive on
large repos (see below) this behavior is turned off if the user sets the
"checkout.optimizeNewBranch" setting.

$ git checkout -b newbranch
read-cache.c:2006       performance: 0.517875500 s: read cache .git/index
name-hash.c:605         performance: 0.277765000 s: initialize name hash
preload-index.c:111     performance: 0.019401300 s: preload index
diff-lib.c:527          performance: 3.807563700 s: diff-index
Switched to a new branch 'newbranch'
trace.c:420             performance: 5.044219600 s: git command: c:git checkout -b newbranch


$ git checkout -b newbranch1
Switched to a new branch 'newbranch1'
trace.c:420             performance: 0.332873600 s: git command: c:git checkout -b newbranch111

Notes:
    Base Ref: master
    Web-Diff: https://github.com/benpeart/git/commit/aebe02c966
    Checkout: git fetch https://github.com/benpeart/git checkout-b-v3 && git checkout aebe02c966
    
    ### Patches

 Documentation/config.txt         |   8 +++
 builtin/checkout.c               | 120 +++++++++++++++++++++++++++++--
 t/t1090-sparse-checkout-scope.sh |  14 ++++
 3 files changed, 138 insertions(+), 4 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index fd8d27e761..2298ecd753 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1135,6 +1135,14 @@ and by linkgit:git-worktree[1] when 'git worktree add' refers to a
 remote branch. This setting might be used for other checkout-like
 commands or functionality in the future.
 
+checkout.optimizeNewBranch
+	Optimizes the performance of "git checkout -b <new_branch>" when
+	using sparse-checkout.  When set to true, git will not update the
+	repo based on the current sparse-checkout settings.  This means it
+	will not update the skip-worktree bit in the index nor add/remove
+	files in the working directory to reflect the current sparse checkout
+	settings nor will it show the local changes.
+
 clean.requireForce::
 	A boolean to make git-clean do nothing unless given -f,
 	-i or -n.   Defaults to true.
diff --git a/builtin/checkout.c b/builtin/checkout.c
index cb6bb76312..26b1a5053a 100644
--- a/builtin/checkout.c
+++ b/builtin/checkout.c
@@ -25,6 +25,8 @@
 #include "submodule.h"
 #include "advice.h"
 
+static int checkout_optimize_new_branch;
+
 static const char * const checkout_usage[] = {
 	N_("git checkout [<options>] <branch>"),
 	N_("git checkout [<options>] [<branch>] -- <file>..."),
@@ -42,6 +44,10 @@ struct checkout_opts {
 	int ignore_skipworktree;
 	int ignore_other_worktrees;
 	int show_progress;
+	/*
+	 * If new checkout options are added, skip_merge_working_tree
+	 * should be updated accordingly.
+	 */
 
 	const char *new_branch;
 	const char *new_branch_force;
@@ -472,6 +478,98 @@ static void setup_branch_path(struct branch_info *branch)
 	branch->path = strbuf_detach(&buf, NULL);
 }
 
+/*
+ * Skip merging the trees, updating the index and working directory if and
+ * only if we are creating a new branch via "git checkout -b <new_branch>."
+ */
+static int skip_merge_working_tree(const struct checkout_opts *opts,
+	const struct branch_info *old_branch_info,
+	const struct branch_info *new_branch_info)
+{
+	/*
+	 * Do the merge if sparse checkout is on and the user has not opted in
+	 * to the optimized behavior
+	 */
+	if (core_apply_sparse_checkout && !checkout_optimize_new_branch)
+		return 0;
+
+	/*
+	 * We must do the merge if we are actually moving to a new commit.
+	 */
+	if (!old_branch_info->commit || !new_branch_info->commit ||
+		oidcmp(&old_branch_info->commit->object.oid, &new_branch_info->commit->object.oid))
+		return 0;
+
+	/*
+	 * opts->patch_mode cannot be used with switching branches so is
+	 * not tested here
+	 */
+
+	/*
+	 * opts->quiet only impacts output so doesn't require a merge
+	 */
+
+	/*
+	 * Honor the explicit request for a three-way merge or to throw away
+	 * local changes
+	 */
+	if (opts->merge || opts->force)
+		return 0;
+
+	/*
+	 * --detach is documented as "updating the index and the files in the
+	 * working tree" but this optimization skips those steps so fall through
+	 * to the regular code path.
+	 */
+	if (opts->force_detach)
+		return 0;
+
+	/*
+	 * opts->writeout_stage cannot be used with switching branches so is
+	 * not tested here
+	 */
+
+	/*
+	 * Honor the explicit ignore requests
+	 */
+	if (!opts->overwrite_ignore || opts->ignore_skipworktree ||
+		opts->ignore_other_worktrees)
+		return 0;
+
+	/*
+	 * opts->show_progress only impacts output so doesn't require a merge
+	 */
+
+	/*
+	 * If we aren't creating a new branch any changes or updates will
+	 * happen in the existing branch.  Since that could only be updating
+	 * the index and working directory, we don't want to skip those steps
+	 * or we've defeated any purpose in running the command.
+	 */
+	if (!opts->new_branch)
+		return 0;
+
+	/*
+	 * new_branch_force is defined to "create/reset and checkout a branch"
+	 * so needs to go through the merge to do the reset
+	 */
+	if (opts->new_branch_force)
+		return 0;
+
+	/*
+	 * A new orphaned branch requrires the index and the working tree to be
+	 * adjusted to <start_point>
+	 */
+	if (opts->new_orphan_branch)
+		return 0;
+
+	/*
+	 * Remaining variables are not checkout options but used to track state
+	 */
+
+	return 1;
+}
+
 static int merge_working_tree(const struct checkout_opts *opts,
 			      struct branch_info *old_branch_info,
 			      struct branch_info *new_branch_info,
@@ -846,10 +944,19 @@ static int switch_branches(const struct checkout_opts *opts,
 		parse_commit_or_die(new_branch_info->commit);
 	}
 
-	ret = merge_working_tree(opts, &old_branch_info, new_branch_info, &writeout_error);
-	if (ret) {
-		free(path_to_free);
-		return ret;
+	/* optimize the "checkout -b <new_branch> path */
+	if (skip_merge_working_tree(opts, &old_branch_info, new_branch_info)) {
+		if (!checkout_optimize_new_branch && !opts->quiet) {
+			if (read_cache_preload(NULL) < 0)
+				return error(_("index file corrupt"));
+			show_local_changes(&new_branch_info->commit->object, &opts->diff_options);
+		}
+	} else {
+		ret = merge_working_tree(opts, &old_branch_info, new_branch_info, &writeout_error);
+		if (ret) {
+			free(path_to_free);
+			return ret;
+		}
 	}
 
 	if (!opts->quiet && !old_branch_info.path && old_branch_info.commit && new_branch_info->commit != old_branch_info.commit)
@@ -864,6 +971,11 @@ static int switch_branches(const struct checkout_opts *opts,
 
 static int git_checkout_config(const char *var, const char *value, void *cb)
 {
+	if (!strcmp(var, "checkout.optimizenewbranch")) {
+		checkout_optimize_new_branch = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "diff.ignoresubmodules")) {
 		struct checkout_opts *opts = cb;
 		handle_ignore_submodules_arg(&opts->diff_options, value);
diff --git a/t/t1090-sparse-checkout-scope.sh b/t/t1090-sparse-checkout-scope.sh
index 1f61eb3e88..25d7c700f6 100755
--- a/t/t1090-sparse-checkout-scope.sh
+++ b/t/t1090-sparse-checkout-scope.sh
@@ -31,6 +31,20 @@ test_expect_success 'perform sparse checkout of master' '
 	test_path_is_file c
 '
 
+test_expect_success 'checkout -b checkout.optimizeNewBranch interaction' '
+	cp .git/info/sparse-checkout .git/info/sparse-checkout.bak &&
+	test_when_finished "
+		mv -f .git/info/sparse-checkout.bak .git/info/sparse-checkout
+		git checkout master
+	" &&
+	echo "/b" >>.git/info/sparse-checkout &&
+	test "$(git ls-files -t b)" = "S b" &&
+	git -c checkout.optimizeNewBranch=true checkout -b fast &&
+	test "$(git ls-files -t b)" = "S b" &&
+	git checkout -b slow &&
+	test "$(git ls-files -t b)" = "H b"
+'
+
 test_expect_success 'merge feature branch into sparse checkout of master' '
 	git merge feature &&
 	test_path_is_file a &&

base-commit: 63749b2dea5d1501ff85bab7b8a7f64911d21dea
-- 
2.18.0.windows.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] checkout: optimize "git checkout -b <new_branch>"
  2018-08-16 18:27 ` [PATCH v3] " Ben Peart
@ 2018-08-16 18:37   ` Duy Nguyen
  2018-08-17 12:37     ` Ben Peart
  2018-09-18  5:34   ` [PATCH] config doc: add missing list separator for checkout.optimizeNewBranch Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 32+ messages in thread
From: Duy Nguyen @ 2018-08-16 18:37 UTC (permalink / raw)
  To: Ben Peart; +Cc: Git Mailing List, Junio C Hamano, Eric Sunshine

On Thu, Aug 16, 2018 at 8:27 PM Ben Peart <Ben.Peart@microsoft.com> wrote:
>
> From: Ben Peart <Ben.Peart@microsoft.com>
>
> Skip merging the commit, updating the index and working directory if and
> only if we are creating a new branch via "git checkout -b <new_branch>."
> Any other checkout options will still go through the former code path.
>
> If sparse_checkout is on, require the user to manually opt in to this
> optimzed behavior by setting the config setting checkout.optimizeNewBranch
> to true as we will no longer update the skip-worktree bit in the index, nor
> add/remove files in the working directory to reflect the current sparse
> checkout settings.
>
> For comparison, running "git checkout -b <new_branch>" on a large repo takes:
>
> 14.6 seconds - without this patch
> 0.3 seconds - with this patch

I still don't think we should do this. If you want lightning fast
branch creation, just use 'git branch'. From the timing breakdown you
shown in the other thread it looks like sparse checkout still takes
seconds, which could be optimized (or even excluded, I mentioned this
too). And split index (or something similar if you can't use it) would
give you saving across the board. There is still one idea Elijah gave
me that should further lower traverse_trees()  cost.

But anyway, it's not my call. I'll stop here.
-- 
Duy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] checkout: optimize "git checkout -b <new_branch>"
  2018-08-16 18:37   ` Duy Nguyen
@ 2018-08-17 12:37     ` Ben Peart
  2018-08-19  1:44       ` Elijah Newren
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Peart @ 2018-08-17 12:37 UTC (permalink / raw)
  To: Duy Nguyen, Ben Peart; +Cc: Git Mailing List, Junio C Hamano, Eric Sunshine



On 8/16/2018 2:37 PM, Duy Nguyen wrote:
> On Thu, Aug 16, 2018 at 8:27 PM Ben Peart <Ben.Peart@microsoft.com> wrote:
>>
>> From: Ben Peart <Ben.Peart@microsoft.com>
>>
>> Skip merging the commit, updating the index and working directory if and
>> only if we are creating a new branch via "git checkout -b <new_branch>."
>> Any other checkout options will still go through the former code path.
>>
>> If sparse_checkout is on, require the user to manually opt in to this
>> optimzed behavior by setting the config setting checkout.optimizeNewBranch
>> to true as we will no longer update the skip-worktree bit in the index, nor
>> add/remove files in the working directory to reflect the current sparse
>> checkout settings.
>>
>> For comparison, running "git checkout -b <new_branch>" on a large repo takes:
>>
>> 14.6 seconds - without this patch
>> 0.3 seconds - with this patch
> 
> I still don't think we should do this. If you want lightning fast
> branch creation, just use 'git branch'. From the timing breakdown you
> shown in the other thread it looks like sparse checkout still takes
> seconds, which could be optimized (or even excluded, I mentioned this
> too). And split index (or something similar if you can't use it) would
> give you saving across the board. There is still one idea Elijah gave
> me that should further lower traverse_trees()  cost.
> 

We have investigated some of these already - split index ended up 
slowing things down more than it sped them up do to the higher compute 
costs.  Sparse checkout we've already optimized significantly - limiting 
the patterns we accept so that we can do the lookup via a hashmap 
instead of the robust pattern matching.  We will continue to look for 
other optimizations and appreciate any and all ideas!

In the end, this optimization makes a huge performance improvement by 
avoiding doing a lot of work that isn't necessary.  Taking a command 
from 14+ seconds to sub-second is just too much of a win for us to ignore.

> But anyway, it's not my call. I'll stop here.
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] checkout: optimize "git checkout -b <new_branch>"
  2018-08-17 12:37     ` Ben Peart
@ 2018-08-19  1:44       ` Elijah Newren
  2018-08-20 13:40         ` Ben Peart
  2018-08-20 18:31         ` Junio C Hamano
  0 siblings, 2 replies; 32+ messages in thread
From: Elijah Newren @ 2018-08-19  1:44 UTC (permalink / raw)
  To: Ben Peart
  Cc: Nguyễn Thái Ngọc, Ben Peart, Git Mailing List,
	Junio C Hamano, Eric Sunshine

On Fri, Aug 17, 2018 at 5:41 AM Ben Peart <peartben@gmail.com> wrote:
> On 8/16/2018 2:37 PM, Duy Nguyen wrote:
> > On Thu, Aug 16, 2018 at 8:27 PM Ben Peart <Ben.Peart@microsoft.com> wrote:
> >>
> >> From: Ben Peart <Ben.Peart@microsoft.com>
> >>
> >> Skip merging the commit, updating the index and working directory if and
> >> only if we are creating a new branch via "git checkout -b <new_branch>."
> >> Any other checkout options will still go through the former code path.
> >>
> >> If sparse_checkout is on, require the user to manually opt in to this
> >> optimzed behavior by setting the config setting checkout.optimizeNewBranch
> >> to true as we will no longer update the skip-worktree bit in the index, nor
> >> add/remove files in the working directory to reflect the current sparse
> >> checkout settings.
> >>
> >> For comparison, running "git checkout -b <new_branch>" on a large repo takes:
> >>
> >> 14.6 seconds - without this patch
> >> 0.3 seconds - with this patch
> >
> > I still don't think we should do this. If you want lightning fast
> > branch creation, just use 'git branch'. From the timing breakdown you
> > shown in the other thread it looks like sparse checkout still takes
> > seconds, which could be optimized (or even excluded, I mentioned this
> > too). And split index (or something similar if you can't use it) would
> > give you saving across the board. There is still one idea Elijah gave
> > me that should further lower traverse_trees()  cost.
> >
>
> We have investigated some of these already - split index ended up
> slowing things down more than it sped them up do to the higher compute
> costs.  Sparse checkout we've already optimized significantly - limiting
> the patterns we accept so that we can do the lookup via a hashmap
> instead of the robust pattern matching.  We will continue to look for
> other optimizations and appreciate any and all ideas!
>
> In the end, this optimization makes a huge performance improvement by
> avoiding doing a lot of work that isn't necessary.  Taking a command
> from 14+ seconds to sub-second is just too much of a win for us to ignore.
>
> > But anyway, it's not my call. I'll stop here.

It's even less of my call, but since things seem to be stuck in
what-should-we-do state (as per Junio's comments on this patch in the
last two "What's cooking" emails), and since Ben and Duy obviously
have opposite opinions on Ben's patch, let's see if I might be able to
help at all.  Here's my summary and my findings:

== The pain ==
- For repositories with a really large number of entries (500K as Ben
says), some operations are much slower than it feels like they should
be.
- This does not seem to be GFVS-specific in any way, I can duplicate
slowness with a simple git-bomb[1]-like repo that has a sparse
checkout pattern ignoring the "bomb" side.  (It has 1M+1 entries in
the index, and .git/info/sparse-checkout ignores the 1M so the working
copy only has 1 entry).  The timings on my repo for "git checkout -b
$NEWBRANCH" are almost exactly double what Ben reports he gets on
their repo.

[1] https://kate.io/blog/making-your-own-exploding-git-repos/

== Short term solutions ==
- Alternative git commands exist today to do a fast checkout of a new
branch in a huge repo.  I also get sub-second timings in my
even-bigger repo with this:
   git branch $NEWBRANCH && git symbolic-ref HEAD $NEWBRANCH
But I do understand that wrapping this into a script or executable
(git-fast-new-branch?) and asking users to use it is a usability
problem and an uphill battle.  (Sidenote: this isn't quite the same
operation; it's missing a reflog update.  The -m option to
symbolic-ref doesn't seem to help; I guess the fact that HEAD's
sha1sum is not changing is viewed as not an update?  However, that
could also be scripted around.)
- Ben's patch successfully drops the time for "git checkout -b
$NEWBRANCH" from 26+ seconds (in my cooked-up testcase) to sub-second
(in fact, under .1 seconds for me).  That's a _huge_ win.

== unpack_trees optimization notes ==
- Ben's patch is extremely focused.  It only affects "git checkout -b
$NEWBRANCH".  If someone runs "git branch $NEWBRANCH && git checkout
$NEWBRANCH", they get the old 26+ second timing.  They also suddenly
get the really long timing if they add any other flags or checkout a
commit that differs in only a single entry in the entire tree.  It
would be nice if we did general optimization for all issues rather
than just special casing such narrow cases.
- However, optimizing unpack_trees is hard.  It's really easy to get
lost trying to look at the code.  Time has been spent trying to
optimizing it.  Ben really likes the speedup factors of 2-3 that Duy
has produced.  But he's pessimistic we'll find enough to bridge the
gap for this case.  And he's worried about breaking unrelated stuff
due to the complexity of unpack_trees.
- Duy is pretty sure we can optimize unpack_trees in at least one more
way.  I've tried looking through the code and think there are others,
but then again I'm known to get lost and confused in unpack_trees.

== The patch ==
- Ben's patch only affects the "checkout -b $NEWBRANCH" case.  He
checks for it by looking for any other flag that would be a different
case, and using the old codepath if he finds any.
- This means there is a "giant list of checks" for this optimization,
and an ongoing maintenance burden because if anyone ever adds any
extra options, this optimization might suddenly break things if that
giant list of checks isn't updated.  Ben even added a comment to the
code hoping to help alert others who come along and add extra options:
+ /*
+ * If new checkout options are added, skip_merge_working_tree
+ * should be updated accordingly.
+ */

== Other notes ==
- In my cooked-up testcase, I also noticed that things like git add or
git status were slow in a repo with lots of index entries.  There may
be several other areas in need of performance boosts too.


== My opinions ==
- The performance wins are big enough that I can see why Ben is pushing this.
- I totally see Duy's point that more general optimizations would be
really nice.
- I really dislike the "giant list of checks" and the ongoing
maintenance burden it implies.

Overall, I have to side with Duy and say I don't think we should take
the patch as-is.  Since that'll be frustrating for Ben to hear, I've
tried to come up with some additional alternatives:

== Alternatives (both short and long term) ==
- Use the 'git branch $NEWBRANCH && git symbolic-ref HEAD $NEWBRANCH'
trick.  It's essentially just as fast (well, you exec git twice so
it's slightly slower, but it's close).  However, there's a difficult
get-it-to-the-users hurdle.
- Rewrite this patch so it instead does a very small set of checks at
the beginning of cmd_checkout(); e.g. check if argc == 3 and argv[1]
== "-b" and if so then perform the minimum operations needed to create
and checkout the new branch (maybe even calling in to cmd_branch() and
cmd_symbolic_ref() and calling some reflog update function).  Preface
it with a comment that it's a performance hack that might eventually
be able to go away.
- Look into the performance optimization(s) Duy mentioned.
- Other performance optimizations in unpack_trees(); it seems we have
an awful lot of loops over all cache entries, and it sure seems to me
like some of them could somehow be focused on just the things that are
changing instead of again checking everything.  Or maybe I'm wrong and
they're all needed.
- Possibly crazy idea for massive global performance win: modify how
the cache works so that it can have tree entries instead of just file
entries; make use of that with sparse checkouts (and partial clones?)
so that the cache only stores a tree when that entire tree is
"unwanted".  Suddenly, git status, git add, git checkout, etc., etc.
are all much faster.  merge needs some special work to operate,
though.


I'm potentially interested in eventually looking into and/or helping
with optimization work in this area.  I might even get more time to
work on git at $DAYJOB since sparse-checkouts/partial-clone/gvfs/etc.
all are looking more and more interesting as we have large repos that
are growing quickly (particularly as we consolidate many repos down to
fewer).  But they've said that such work won't take priority for
several months down the road for them, it's not yet clear how much
they'd sponsor vs. just want to see what existed and how we can use
it, and I'd kind of like to clear a few other things off my plate
first too (such as the merge-recursive rewrite).

Anyway, that's just my $0.02.  I hope something I said helps.

Elijah

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] checkout: optimize "git checkout -b <new_branch>"
  2018-08-19  1:44       ` Elijah Newren
@ 2018-08-20 13:40         ` Ben Peart
  2018-08-20 18:16           ` Elijah Newren
  2018-08-20 18:31         ` Junio C Hamano
  1 sibling, 1 reply; 32+ messages in thread
From: Ben Peart @ 2018-08-20 13:40 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Nguyễn Thái Ngọc, Ben Peart, Git Mailing List,
	Junio C Hamano, Eric Sunshine



On 8/18/2018 9:44 PM, Elijah Newren wrote:
> On Fri, Aug 17, 2018 at 5:41 AM Ben Peart <peartben@gmail.com> wrote:
>> On 8/16/2018 2:37 PM, Duy Nguyen wrote:
>>> On Thu, Aug 16, 2018 at 8:27 PM Ben Peart <Ben.Peart@microsoft.com> wrote:
>>>>
>>>> From: Ben Peart <Ben.Peart@microsoft.com>
>>>>
>>>> Skip merging the commit, updating the index and working directory if and
>>>> only if we are creating a new branch via "git checkout -b <new_branch>."
>>>> Any other checkout options will still go through the former code path.
>>>>
>>>> If sparse_checkout is on, require the user to manually opt in to this
>>>> optimzed behavior by setting the config setting checkout.optimizeNewBranch
>>>> to true as we will no longer update the skip-worktree bit in the index, nor
>>>> add/remove files in the working directory to reflect the current sparse
>>>> checkout settings.
>>>>
>>>> For comparison, running "git checkout -b <new_branch>" on a large repo takes:
>>>>
>>>> 14.6 seconds - without this patch
>>>> 0.3 seconds - with this patch
>>>
>>> I still don't think we should do this. If you want lightning fast
>>> branch creation, just use 'git branch'. From the timing breakdown you
>>> shown in the other thread it looks like sparse checkout still takes
>>> seconds, which could be optimized (or even excluded, I mentioned this
>>> too). And split index (or something similar if you can't use it) would
>>> give you saving across the board. There is still one idea Elijah gave
>>> me that should further lower traverse_trees()  cost.
>>>
>>
>> We have investigated some of these already - split index ended up
>> slowing things down more than it sped them up do to the higher compute
>> costs.  Sparse checkout we've already optimized significantly - limiting
>> the patterns we accept so that we can do the lookup via a hashmap
>> instead of the robust pattern matching.  We will continue to look for
>> other optimizations and appreciate any and all ideas!
>>
>> In the end, this optimization makes a huge performance improvement by
>> avoiding doing a lot of work that isn't necessary.  Taking a command
>> from 14+ seconds to sub-second is just too much of a win for us to ignore.
>>
>>> But anyway, it's not my call. I'll stop here.
> 
> It's even less of my call, but since things seem to be stuck in
> what-should-we-do state (as per Junio's comments on this patch in the
> last two "What's cooking" emails), and since Ben and Duy obviously
> have opposite opinions on Ben's patch, let's see if I might be able to
> help at all.  Here's my summary and my findings:
> 
> == The pain ==
> - For repositories with a really large number of entries (500K as Ben
> says), some operations are much slower than it feels like they should
> be.
> - This does not seem to be GFVS-specific in any way, I can duplicate
> slowness with a simple git-bomb[1]-like repo that has a sparse
> checkout pattern ignoring the "bomb" side.  (It has 1M+1 entries in
> the index, and .git/info/sparse-checkout ignores the 1M so the working
> copy only has 1 entry).  The timings on my repo for "git checkout -b
> $NEWBRANCH" are almost exactly double what Ben reports he gets on
> their repo.
> 
> [1] https://kate.io/blog/making-your-own-exploding-git-repos/
> 
> == Short term solutions ==
> - Alternative git commands exist today to do a fast checkout of a new
> branch in a huge repo.  I also get sub-second timings in my
> even-bigger repo with this:
>     git branch $NEWBRANCH && git symbolic-ref HEAD $NEWBRANCH
> But I do understand that wrapping this into a script or executable
> (git-fast-new-branch?) and asking users to use it is a usability
> problem and an uphill battle.  (Sidenote: this isn't quite the same
> operation; it's missing a reflog update.  The -m option to
> symbolic-ref doesn't seem to help; I guess the fact that HEAD's
> sha1sum is not changing is viewed as not an update?  However, that
> could also be scripted around.)
> - Ben's patch successfully drops the time for "git checkout -b
> $NEWBRANCH" from 26+ seconds (in my cooked-up testcase) to sub-second
> (in fact, under .1 seconds for me).  That's a _huge_ win.
> 
> == unpack_trees optimization notes ==
> - Ben's patch is extremely focused.  It only affects "git checkout -b
> $NEWBRANCH".  If someone runs "git branch $NEWBRANCH && git checkout
> $NEWBRANCH", they get the old 26+ second timing.  They also suddenly
> get the really long timing if they add any other flags or checkout a
> commit that differs in only a single entry in the entire tree.  It
> would be nice if we did general optimization for all issues rather
> than just special casing such narrow cases.
> - However, optimizing unpack_trees is hard.  It's really easy to get
> lost trying to look at the code.  Time has been spent trying to
> optimizing it.  Ben really likes the speedup factors of 2-3 that Duy
> has produced.  But he's pessimistic we'll find enough to bridge the
> gap for this case.  And he's worried about breaking unrelated stuff
> due to the complexity of unpack_trees.
> - Duy is pretty sure we can optimize unpack_trees in at least one more
> way.  I've tried looking through the code and think there are others,
> but then again I'm known to get lost and confused in unpack_trees.
> 
> == The patch ==
> - Ben's patch only affects the "checkout -b $NEWBRANCH" case.  He
> checks for it by looking for any other flag that would be a different
> case, and using the old codepath if he finds any.
> - This means there is a "giant list of checks" for this optimization,
> and an ongoing maintenance burden because if anyone ever adds any
> extra options, this optimization might suddenly break things if that
> giant list of checks isn't updated.  Ben even added a comment to the
> code hoping to help alert others who come along and add extra options:
> + /*
> + * If new checkout options are added, skip_merge_working_tree
> + * should be updated accordingly.
> + */
> 
> == Other notes ==
> - In my cooked-up testcase, I also noticed that things like git add or
> git status were slow in a repo with lots of index entries.  There may
> be several other areas in need of performance boosts too.
> 
> 
> == My opinions ==
> - The performance wins are big enough that I can see why Ben is pushing this.
> - I totally see Duy's point that more general optimizations would be
> really nice.
> - I really dislike the "giant list of checks" and the ongoing
> maintenance burden it implies.
> 
> Overall, I have to side with Duy and say I don't think we should take
> the patch as-is.  Since that'll be frustrating for Ben to hear, I've
> tried to come up with some additional alternatives:

Thank you for the thorough review and summary.  From my perspective, it 
is both fair and accurate.

This is a performance improvement we've been using for nearly 2 years 
[1] because the win is so significant in our case, it makes living with 
the maintenance burden worth while.  That said, in the past 2 years, 
there haven't been any new options added to the checkout command that 
required any maintenance but the possibility clearly does exist.

My purpose in pushing this patch to the mailing list is to improve git 
in general and to let others benefit from the performance improvement 
we've been enjoying. If the community doesn't feel like the risk/reward 
is worth it, then it shouldn't be accepted and I'm perfectly fine with that.

[1] 
https://github.com/git-for-windows/git/commit/0a64b9ae6df02d4df3e41227c43c4217a1a90bfa

> 
> == Alternatives (both short and long term) ==
> - Use the 'git branch $NEWBRANCH && git symbolic-ref HEAD $NEWBRANCH'
> trick.  It's essentially just as fast (well, you exec git twice so
> it's slightly slower, but it's close).  However, there's a difficult
> get-it-to-the-users hurdle.
> - Rewrite this patch so it instead does a very small set of checks at
> the beginning of cmd_checkout(); e.g. check if argc == 3 and argv[1]
> == "-b" and if so then perform the minimum operations needed to create
> and checkout the new branch (maybe even calling in to cmd_branch() and
> cmd_symbolic_ref() and calling some reflog update function).  Preface
> it with a comment that it's a performance hack that might eventually
> be able to go away.

I'm happy to do this if it would make the patch better/more acceptable 
to the community.

One other change I could do would be to address Duy's concern about the 
config option would be to remove that entirely.  That means those using 
sparse-checkout don't have the option of getting the performance win but 
the benefit is that there isn't any documentation or behavior changes 
that need to be made - things just get a lot faster if you happen to use 
the (rather common) "checkout -b" option.

If making these two changes would resolve the remaining concerns, I'm 
happy to make them and send another patch series.  If it will just 
result in another round of "I'm still not comfortable with this" then 
I'd rather spend my time elsewhere. :)

> - Look into the performance optimization(s) Duy mentioned.
> - Other performance optimizations in unpack_trees(); it seems we have
> an awful lot of loops over all cache entries, and it sure seems to me
> like some of them could somehow be focused on just the things that are
> changing instead of again checking everything.  Or maybe I'm wrong and
> they're all needed.

I am absolutely in favor of any performance optimizations that can be 
made in unpack_trees().  I am not very familiar with that code and only 
recently started digging in when I started my recent RFC/patch series to 
multi-thread it [2].  I'm extremely grateful for all the ideas and work 
that thread has triggered - to me, it's an example of open source 
software development at it's finest.

I'd never even thought of using the cache-tree to simplify the traversal 
in that way and am grateful others have and especially to Duy who backed 
it up with a great patch series (I started the work but he was a lot 
faster at getting working patches than I was!).  In my opinion, this was 
a much better solution than the complexity of multi-threading this 
complex piece of code (though that remains an option if it is still needed).

[2] 
https://public-inbox.org/git/20180718204458.20936-1-benpeart@microsoft.com/

> - Possibly crazy idea for massive global performance win: modify how
> the cache works so that it can have tree entries instead of just file
> entries; make use of that with sparse checkouts (and partial clones?)
> so that the cache only stores a tree when that entire tree is
> "unwanted".  Suddenly, git status, git add, git checkout, etc., etc.
> are all much faster.  merge needs some special work to operate,
> though.
> 

Not crazy at all!  It's something we've investigated and discussed 
several times over the past couple of years.  This is the largest 
potential optimization that I am aware of as it has the potential to 
dramatically speed up most git operations as they become O(what I care 
about) vs O(size of repo/tree) - especially when paired with partial clone.

The downside is pretty obvious, pulling this off means rewriting an 
awful lot of the code!  The index is central to so much of git and the 
assumption is everywhere that the index is simply a (global) array of 
cache entries to be looped through and manipulated at will.

There would be a lot of refactoring work, creating a new index 
abstraction API, updating the code to utilize it, then updating the 
internal representation of the index to optimize it around the concept 
of being sparse/partial.  Likely, we'd want/need a new on disk format 
that would support the new structures in an optimal way.  Finally, the 
various algorithms (like unpack_trees()) could be rewritten to take 
advantage of the sparseness.

Not something that can be taken on lightly - but certainly something 
that has huge potential to enable git to scale to much larger repos.

> 
> I'm potentially interested in eventually looking into and/or helping
> with optimization work in this area.  I might even get more time to
> work on git at $DAYJOB since sparse-checkouts/partial-clone/gvfs/etc.
> all are looking more and more interesting as we have large repos that
> are growing quickly (particularly as we consolidate many repos down to
> fewer).  But they've said that such work won't take priority for
> several months down the road for them, it's not yet clear how much
> they'd sponsor vs. just want to see what existed and how we can use
> it, and I'd kind of like to clear a few other things off my plate
> first too (such as the merge-recursive rewrite).
> 
> Anyway, that's just my $0.02.  I hope something I said helps.
> 
> Elijah
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] checkout: optimize "git checkout -b <new_branch>"
  2018-08-20 13:40         ` Ben Peart
@ 2018-08-20 18:16           ` Elijah Newren
  2018-08-21 14:51             ` Duy Nguyen
  0 siblings, 1 reply; 32+ messages in thread
From: Elijah Newren @ 2018-08-20 18:16 UTC (permalink / raw)
  To: Ben Peart
  Cc: Nguyễn Thái Ngọc, Ben Peart, Git Mailing List,
	Junio C Hamano, Eric Sunshine

On Mon, Aug 20, 2018 at 6:40 AM Ben Peart <peartben@gmail.com> wrote:
> On 8/18/2018 9:44 PM, Elijah Newren wrote:
...
> > == My opinions ==
> > - The performance wins are big enough that I can see why Ben is pushing this.
> > - I totally see Duy's point that more general optimizations would be
> > really nice.
> > - I really dislike the "giant list of checks" and the ongoing
> > maintenance burden it implies.
> >
> > Overall, I have to side with Duy and say I don't think we should take
> > the patch as-is.  Since that'll be frustrating for Ben to hear, I've
> > tried to come up with some additional alternatives:
>
> Thank you for the thorough review and summary.  From my perspective, it
> is both fair and accurate.
>
...
> > - Rewrite this patch so it instead does a very small set of checks at
> > the beginning of cmd_checkout(); e.g. check if argc == 3 and argv[1]
> > == "-b" and if so then perform the minimum operations needed to create
> > and checkout the new branch (maybe even calling in to cmd_branch() and
> > cmd_symbolic_ref() and calling some reflog update function).  Preface
> > it with a comment that it's a performance hack that might eventually
> > be able to go away.
>
> I'm happy to do this if it would make the patch better/more acceptable
> to the community.

For whatever it's worth, it'd change my vote.  While I would like
general optimizations more, I do think that's overall a longer term
prospect, and the wins here are big enough to justify a performance
hack -- if the amount of new code and added maintenance overhead is
minimized.  I think that checking for the args matching checkout -b
rather than checking all flags that might imply something other than
args being checkout -b would be a good way to help minimize both.

> One other change I could do would be to address Duy's concern about the
> config option would be to remove that entirely.  That means those using
> sparse-checkout don't have the option of getting the performance win but
> the benefit is that there isn't any documentation or behavior changes
> that need to be made - things just get a lot faster if you happen to use
> the (rather common) "checkout -b" option.

Playing with sparse-checkout, it feels to me like a half-baked
feature.  It seems like it required too much manual work, and it was
sometimes hard to tell if I was misunderstanding configuration rules
or I was just running into bugs in the code.  I think I hit both but I
didn't really want to get side-tracked further, yet.  (I do want to
eventually come back to it.)  The only reason someone would go through
that pain is if it provided massive performance benefits.

Viewing things when-in-doubt through the lens of strict backwards
compatibility may inadvertently mean enforcing buggy and/or slow
behavior with sparse-checkouts.  So, while git usually focuses
strongly on backwards compatibility (which is a good thing), that
might actually be wrong in the specific case of the sparse-checkout
feature.  In particular, since the whole point of the feature is
essentially a performance hack, if the initial implementation has ways
in which it really hurts performance, it seems like it'd be more in
keeping with the point of the feature to take performance fixes and
swallow subtle behavioral changes than to require special flags to get
decent performance.

That might sound like a long winded way of saying I'm totally in favor
of dropping the config option in this case, but I'm also trying to
build a case that I think there will be other situations where we want
to subtly and maybe even not-so-subtly change behavior of
sparse-checkouts to make them both more performant and far nicer for
end users.  I'm not sure others will agree with me on this argument,
but sparse checkouts felt painful to me and I think we need something
better.

> If making these two changes would resolve the remaining concerns, I'm
> happy to make them and send another patch series.  If it will just
> result in another round of "I'm still not comfortable with this" then
> I'd rather spend my time elsewhere. :)

That's fair.  The first change would be enough to resolve the concern
for me if others strongly push back on the second
(config-option-removal) change, but I'd definitely prefer both changes
-- fewer options means less code, less ongoing maintenance work, and
less manual setup needed by users.

> > - Possibly crazy idea for massive global performance win: modify how
> > the cache works so that it can have tree entries instead of just file
> > entries; make use of that with sparse checkouts (and partial clones?)
> > so that the cache only stores a tree when that entire tree is
> > "unwanted".  Suddenly, git status, git add, git checkout, etc., etc.
> > are all much faster.  merge needs some special work to operate,
> > though.
>
> Not crazy at all!  It's something we've investigated and discussed
> several times over the past couple of years.  This is the largest
> potential optimization that I am aware of as it has the potential to
> dramatically speed up most git operations as they become O(what I care
> about) vs O(size of repo/tree) - especially when paired with partial clone.
>
> The downside is pretty obvious, pulling this off means rewriting an
> awful lot of the code!  The index is central to so much of git and the
> assumption is everywhere that the index is simply a (global) array of
> cache entries to be looped through and manipulated at will.
>
> There would be a lot of refactoring work, creating a new index
> abstraction API, updating the code to utilize it, then updating the
> internal representation of the index to optimize it around the concept
> of being sparse/partial.  Likely, we'd want/need a new on disk format
> that would support the new structures in an optimal way.  Finally, the
> various algorithms (like unpack_trees()) could be rewritten to take
> advantage of the sparseness.
>
> Not something that can be taken on lightly - but certainly something
> that has huge potential to enable git to scale to much larger repos.

Well, the amount of work involved might qualify it as a "crazy" idea.
:-)  But, I'm glad others think it sounds like a useful path to purse.
Not only would it be a tremendous help performance-wise, it might also
be beneficial for another actually-crazy feature idea of mine.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] checkout: optimize "git checkout -b <new_branch>"
  2018-08-19  1:44       ` Elijah Newren
  2018-08-20 13:40         ` Ben Peart
@ 2018-08-20 18:31         ` Junio C Hamano
  1 sibling, 0 replies; 32+ messages in thread
From: Junio C Hamano @ 2018-08-20 18:31 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Ben Peart, Nguyễn Thái Ngọc, Ben Peart,
	Git Mailing List, Eric Sunshine

Elijah Newren <newren@gmail.com> writes:

> == The patch ==
> - Ben's patch only affects the "checkout -b $NEWBRANCH" case.  He
> checks for it by looking for any other flag that would be a different
> case, and using the old codepath if he finds any.
> - This means there is a "giant list of checks" for this optimization,
> and an ongoing maintenance burden because if anyone ever adds any
> extra options, this optimization might suddenly break things if that
> giant list of checks isn't updated.

Correct.  

Having said that, I do not mind leaving it in 'pu' for the rest of
the cycle, possibly merging it to 'next' after that and keeping it
there, to see how horrible a maintenance burden it would become in
real life, though.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] checkout: optimize "git checkout -b <new_branch>"
  2018-08-20 18:16           ` Elijah Newren
@ 2018-08-21 14:51             ` Duy Nguyen
  2018-08-30 17:22               ` Elijah Newren
  0 siblings, 1 reply; 32+ messages in thread
From: Duy Nguyen @ 2018-08-21 14:51 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Ben Peart, Ben Peart, Git Mailing List, Junio C Hamano,
	Eric Sunshine

On Mon, Aug 20, 2018 at 8:16 PM Elijah Newren <newren@gmail.com> wrote:
> Playing with sparse-checkout, it feels to me like a half-baked
> feature.  It seems like it required too much manual work, and it was
> sometimes hard to tell if I was misunderstanding configuration rules
> or I was just running into bugs in the code.  I think I hit both but I
> didn't really want to get side-tracked further, yet.  (I do want to
> eventually come back to it.)  The only reason someone would go through
> that pain is if it provided massive performance benefits.

In my defense it was one of my first contribution when I was naiver
and basically an evolution of "git update-index --assume-unchanged". I
have something in the queue to improve/complement sparse-checkout but
my last update on that branch was 2.5 years ago, so it's not coming
soon.

I'd love to year how sparse checkout could be improved, or even
replaced. I think we still have to have some configuration rules, and
yes the flexibility of sparse checkout (or gitignore to be precise)
rules is a double-edged sword.
-- 
Duy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] checkout: optimize "git checkout -b <new_branch>"
  2018-08-21 14:51             ` Duy Nguyen
@ 2018-08-30 17:22               ` Elijah Newren
  2018-09-04 16:46                 ` Duy Nguyen
  0 siblings, 1 reply; 32+ messages in thread
From: Elijah Newren @ 2018-08-30 17:22 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc
  Cc: Ben Peart, Ben Peart, Git Mailing List, Junio C Hamano,
	Eric Sunshine

Hi Duy,

On Tue, Aug 21, 2018 at 7:52 AM Duy Nguyen <pclouds@gmail.com> wrote:
>
> On Mon, Aug 20, 2018 at 8:16 PM Elijah Newren <newren@gmail.com> wrote:
> > Playing with sparse-checkout, it feels to me like a half-baked
> > feature.  It seems like it required too much manual work, and it was
> > sometimes hard to tell if I was misunderstanding configuration rules
> > or I was just running into bugs in the code.  I think I hit both but I
> > didn't really want to get side-tracked further, yet.  (I do want to
> > eventually come back to it.)  The only reason someone would go through
> > that pain is if it provided massive performance benefits.
>
> In my defense it was one of my first contribution when I was naiver
> and basically an evolution of "git update-index --assume-unchanged". I
> have something in the queue to improve/complement sparse-checkout but
> my last update on that branch was 2.5 years ago, so it's not coming
> soon.
>
> I'd love to year how sparse checkout could be improved, or even
> replaced. I think we still have to have some configuration rules, and
> yes the flexibility of sparse checkout (or gitignore to be precise)
> rules is a double-edged sword.

Sorry for taking a while to respond, and if what I said came across
harshly.  I agree that the flexibility of the rules makes it more
complicated, though I think a bigger issue may be that the feature is
hard to make smooth unless coupled to something like partial clones.
Work on that is ongoing.  Anyway, in an attempt to be helpful, here
were some of the pain points I ran across:

- The fact that documentation could only be found in a low-level
plumbing command like read-tree made discoverability hard.  Why would
folks think to look there?  (I can't remember if I had to google it or
just grepped around the git source code to find it.)

- Needing to use read-tree, which isn't something most users are
familiar with, makes for a learning curve.  I may know what some of
the flags in read-tree do, but users will puzzle over which things on
the page happen to be relevant to them -- especially since the section
on sparse checkouts don't specify how read-tree should be invoked
after .git/info/sparse-checkout is populated.  Even I couldn't guess
what I was supposed to ran and just googled for hints.  Here's some
possible failures, as users guess which flags to pass:

$ git read-tree
warning: read-tree: emptying the index with no arguments is
deprecated; use --empty

$ git read-tree HEAD  # Oops, doesn't update the working directory

$ git read-tree -u HEAD  # Doesn't do any updates either

$ git read-tree -mu HEAD  # Works...but make user think "Why/what am I
merging?!?"

- I actually misunderstood or misread the documentation about undoing
sparse checkouts and failed multiple times.  I think I nuked the index
then did a 'git reset --hard HEAD'.  Re-reading, it looks like you did
explain it, and I don't remember why/how I missed it the first time
around, but I did.

- I either failed to grasp how to specify negative refs or botched
something else up.  I tried digging for a while to figure out how to
exclude my massive directory, but was always met with:
    error: Sparse checkout leaves no entry on working directory
I spent a while trying to figure out what I did wrong, but gave up
after half an hour or so since I wasn't trying to use the feature for
real and just specified the files I wanted to keep instead.
Re-reading the docs, it looks like you specified how to do this, and
re-trying now it works, but I repeatedly passed over the '/*' in the
docs and read it as either bad formatting or a highlight of the next
line rather than as important literal text.  So, maybe part of my
problem is that I just can't read.  :-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] checkout: optimize "git checkout -b <new_branch>"
  2018-08-30 17:22               ` Elijah Newren
@ 2018-09-04 16:46                 ` Duy Nguyen
  0 siblings, 0 replies; 32+ messages in thread
From: Duy Nguyen @ 2018-09-04 16:46 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Ben Peart, Ben Peart, Git Mailing List, Junio C Hamano,
	Eric Sunshine

On Thu, Aug 30, 2018 at 7:22 PM Elijah Newren <newren@gmail.com> wrote:
>
> Hi Duy,
>
> On Tue, Aug 21, 2018 at 7:52 AM Duy Nguyen <pclouds@gmail.com> wrote:
> >
> > On Mon, Aug 20, 2018 at 8:16 PM Elijah Newren <newren@gmail.com> wrote:
> > > Playing with sparse-checkout, it feels to me like a half-baked
> > > feature.  It seems like it required too much manual work, and it was
> > > sometimes hard to tell if I was misunderstanding configuration rules
> > > or I was just running into bugs in the code.  I think I hit both but I
> > > didn't really want to get side-tracked further, yet.  (I do want to
> > > eventually come back to it.)  The only reason someone would go through
> > > that pain is if it provided massive performance benefits.
> >
> > In my defense it was one of my first contribution when I was naiver
> > and basically an evolution of "git update-index --assume-unchanged". I
> > have something in the queue to improve/complement sparse-checkout but
> > my last update on that branch was 2.5 years ago, so it's not coming
> > soon.
> >
> > I'd love to year how sparse checkout could be improved, or even
> > replaced. I think we still have to have some configuration rules, and
> > yes the flexibility of sparse checkout (or gitignore to be precise)
> > rules is a double-edged sword.
>
> Sorry for taking a while to respond, and if what I said came across
> harshly.  I agree that the flexibility of the rules makes it more
> complicated, though I think a bigger issue may be that the feature is
> hard to make smooth unless coupled to something like partial clones.

For something like partial clones, we would need something like
partial indexes. That is, the index does not record all paths in
worktree. The problem is at write-tree time, how to create trees with
such a partial index. So far the only option I see is record
directories in the index (in the same way we record submodule's
commit), which reduces the index and we are still able to create trees
from it.

> Work on that is ongoing.  Anyway, in an attempt to be helpful, here
> were some of the pain points I ran across:
>
> ..

Thanks. I think this basically boils down to no good UI (and user
facing command also gives us a man page to describe stuff instead of
hiding everything behind git-read-tree.txt). This is something I've
been meaning to do but never got around to (also a couple of sparse
checkout optimizations). There's also "git sparse-checkout" [1] that
would be a good starting point for this, but I don't think the author
got to the point to submit it to git.

[1] https://github.com/kjp/git-tools/blob/master/git-sparse-checkout
-- 
Duy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] config doc: add missing list separator for checkout.optimizeNewBranch
  2018-08-16 18:27 ` [PATCH v3] " Ben Peart
  2018-08-16 18:37   ` Duy Nguyen
@ 2018-09-18  5:34   ` Ævar Arnfjörð Bjarmason
  2018-09-18 16:57     ` Taylor Blau
  2018-09-18 17:13     ` Jeff King
  1 sibling, 2 replies; 32+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-09-18  5:34 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Ben Peart, Eric Sunshine,
	Nguyễn Thái Ngọc Duy,
	Ævar Arnfjörð Bjarmason

The documentation added in fa655d8411 ("checkout: optimize "git
checkout -b <new_branch>"", 2018-08-16) didn't add the double-colon
needed for the labeled list separator, as a result the added
documentation all got squashed into one paragraph. Fix that by adding
the list separator.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---

Junio: Feel free to squash this in, but per your recent E-Mail it
doesn't seem you're planning to rewind "next", so this can go on top
of gitster/bp/checkout-new-branch-optim.

 Documentation/config.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ac71ade256..1546833213 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1153,7 +1153,7 @@ and by linkgit:git-worktree[1] when 'git worktree add' refers to a
 remote branch. This setting might be used for other checkout-like
 commands or functionality in the future.
 
-checkout.optimizeNewBranch
+checkout.optimizeNewBranch::
 	Optimizes the performance of "git checkout -b <new_branch>" when
 	using sparse-checkout.  When set to true, git will not update the
 	repo based on the current sparse-checkout settings.  This means it
-- 
2.19.0.rc2.392.g5ba43deb5a


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH] config doc: add missing list separator for checkout.optimizeNewBranch
  2018-09-18  5:34   ` [PATCH] config doc: add missing list separator for checkout.optimizeNewBranch Ævar Arnfjörð Bjarmason
@ 2018-09-18 16:57     ` Taylor Blau
  2018-09-18 17:16       ` Jeff King
  2018-09-18 17:13     ` Jeff King
  1 sibling, 1 reply; 32+ messages in thread
From: Taylor Blau @ 2018-09-18 16:57 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Ben Peart, Eric Sunshine,
	Nguyễn Thái Ngọc Duy

Hi Ævar,

On Tue, Sep 18, 2018 at 05:34:49AM +0000, Ævar Arnfjörð Bjarmason wrote:
> The documentation added in fa655d8411 ("checkout: optimize "git
> checkout -b <new_branch>"", 2018-08-16) didn't add the double-colon
> needed for the labeled list separator, as a result the added
> documentation all got squashed into one paragraph. Fix that by adding
> the list separator.

Looks good. Here's my:

  Signed-off-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] config doc: add missing list separator for checkout.optimizeNewBranch
  2018-09-18  5:34   ` [PATCH] config doc: add missing list separator for checkout.optimizeNewBranch Ævar Arnfjörð Bjarmason
  2018-09-18 16:57     ` Taylor Blau
@ 2018-09-18 17:13     ` Jeff King
  2018-09-19  4:41       ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 32+ messages in thread
From: Jeff King @ 2018-09-18 17:13 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Ben Peart, Eric Sunshine,
	Nguyễn Thái Ngọc Duy

On Tue, Sep 18, 2018 at 05:34:49AM +0000, Ævar Arnfjörð Bjarmason wrote:

> The documentation added in fa655d8411 ("checkout: optimize "git
> checkout -b <new_branch>"", 2018-08-16) didn't add the double-colon
> needed for the labeled list separator, as a result the added
> documentation all got squashed into one paragraph. Fix that by adding
> the list separator.

Obviously the right thing to do, but your patch does not seem to apply
on that commit. Looks like you built it on a more recent commit that
also has checkout.defaultRemote (i.e., probably 'next')?

> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index ac71ade256..1546833213 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -1153,7 +1153,7 @@ and by linkgit:git-worktree[1] when 'git worktree add' refers to a
>  remote branch. This setting might be used for other checkout-like
>  commands or functionality in the future.
>  
> -checkout.optimizeNewBranch
> +checkout.optimizeNewBranch::
>  	Optimizes the performance of "git checkout -b <new_branch>" when
>  	using sparse-checkout.  When set to true, git will not update the
>  	repo based on the current sparse-checkout settings.  This means it

Now the real reason I responded. :)

I just want to plug the recently-added "doc-diff" script, which diffs a
rendered view and makes it easier to catch cases like this. E.g., after
applying your patch:

  $ cd Documentation
  $ ./doc-diff HEAD^ HEAD
  [...]
  diff --git a/6ad88fd5d99b8dcbb69c6f3239ad41e66516d8f7/none/share/man/man1/git-config.1 b/28e258682021a09cd7842e4edbb21ad231c8bb7a/none/share/man/man1/git-config.1
  index 7298292112..fa91b5f2d1 100644
  --- a/6ad88fd5d99b8dcbb69c6f3239ad41e66516d8f7/none/share/man/man1/git-config.1
  +++ b/28e258682021a09cd7842e4edbb21ad231c8bb7a/none/share/man/man1/git-config.1
  @@ -1533,12 +1533,13 @@ CONFIGURATION FILE
              branch. This setting might be used for other checkout-like commands
              or functionality in the future.
   
  -       checkout.optimizeNewBranch Optimizes the performance of "git checkout
  -       -b <new_branch>" when using sparse-checkout. When set to true, git will
  -       not update the repo based on the current sparse-checkout settings. This
  -       means it will not update the skip-worktree bit in the index nor
  -       add/remove files in the working directory to reflect the current sparse
  -       checkout settings nor will it show the local changes.
  +       checkout.optimizeNewBranch
  +           Optimizes the performance of "git checkout -b <new_branch>" when
  +           using sparse-checkout. When set to true, git will not update the
  +           repo based on the current sparse-checkout settings. This means it
  +           will not update the skip-worktree bit in the index nor add/remove
  +           files in the working directory to reflect the current sparse
  +           checkout settings nor will it show the local changes.
   
          clean.requireForce
              A boolean to make git-clean do nothing unless given -f, -i or -n.

Obviously we knew this fix was going to help, but running it on the
original fa655d8411 (versus its parent) and eyeballing the result might
have caught this sooner.

(Though I would not be at all surprised to hear that the script does not
run on Windows, as it relies on things like "man"; patches certainly
welcome).

-Peff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] config doc: add missing list separator for checkout.optimizeNewBranch
  2018-09-18 16:57     ` Taylor Blau
@ 2018-09-18 17:16       ` Jeff King
  2018-09-18 17:20         ` Taylor Blau
  0 siblings, 1 reply; 32+ messages in thread
From: Jeff King @ 2018-09-18 17:16 UTC (permalink / raw)
  To: Taylor Blau
  Cc: Ævar Arnfjörð Bjarmason, git, Junio C Hamano,
	Ben Peart, Eric Sunshine, Nguyễn Thái Ngọc Duy

On Tue, Sep 18, 2018 at 12:57:07PM -0400, Taylor Blau wrote:

> Hi Ævar,
> 
> On Tue, Sep 18, 2018 at 05:34:49AM +0000, Ævar Arnfjörð Bjarmason wrote:
> > The documentation added in fa655d8411 ("checkout: optimize "git
> > checkout -b <new_branch>"", 2018-08-16) didn't add the double-colon
> > needed for the labeled list separator, as a result the added
> > documentation all got squashed into one paragraph. Fix that by adding
> > the list separator.
> 
> Looks good. Here's my:
> 
>   Signed-off-by: Taylor Blau <me@ttaylorr.com>

I'm confused here. The signoff is really about agreeing publicly to the
DCO, and providing a chain of custody for the changes. So sometimes a
signoff from somebody besides the patch author is good if they
contributed content to the patch, but I don't see that here (or in any
nearby thread).

Did you mean "Reviewed-by:" ?

-Peff

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] config doc: add missing list separator for checkout.optimizeNewBranch
  2018-09-18 17:16       ` Jeff King
@ 2018-09-18 17:20         ` Taylor Blau
  0 siblings, 0 replies; 32+ messages in thread
From: Taylor Blau @ 2018-09-18 17:20 UTC (permalink / raw)
  To: Jeff King
  Cc: Taylor Blau, Ævar Arnfjörð Bjarmason, git,
	Junio C Hamano, Ben Peart, Eric Sunshine,
	Nguyễn Thái Ngọc Duy

On Tue, Sep 18, 2018 at 01:16:43PM -0400, Jeff King wrote:
> On Tue, Sep 18, 2018 at 12:57:07PM -0400, Taylor Blau wrote:
>
> > Hi Ævar,
> >
> > On Tue, Sep 18, 2018 at 05:34:49AM +0000, Ævar Arnfjörð Bjarmason wrote:
> > > The documentation added in fa655d8411 ("checkout: optimize "git
> > > checkout -b <new_branch>"", 2018-08-16) didn't add the double-colon
> > > needed for the labeled list separator, as a result the added
> > > documentation all got squashed into one paragraph. Fix that by adding
> > > the list separator.
> >
> > Looks good. Here's my:
> >
> >   Signed-off-by: Taylor Blau <me@ttaylorr.com>
>
> I'm confused here. The signoff is really about agreeing publicly to the
> DCO, and providing a chain of custody for the changes. So sometimes a
> signoff from somebody besides the patch author is good if they
> contributed content to the patch, but I don't see that here (or in any
> nearby thread).
>
> Did you mean "Reviewed-by:" ?

Indeed, I meant "Reviewed-by" instead of "Signed-off-by". I grok the
difference between the two, but without thinking about it I typed one
instead of the other.

Instead, here's my Reviewed-by:

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] config doc: add missing list separator for checkout.optimizeNewBranch
  2018-09-18 17:13     ` Jeff King
@ 2018-09-19  4:41       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 32+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-09-19  4:41 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Junio C Hamano, Ben Peart, Eric Sunshine,
	Nguyễn Thái Ngọc Duy


On Tue, Sep 18 2018, Jeff King wrote:

> On Tue, Sep 18, 2018 at 05:34:49AM +0000, Ævar Arnfjörð Bjarmason wrote:
>
>> The documentation added in fa655d8411 ("checkout: optimize "git
>> checkout -b <new_branch>"", 2018-08-16) didn't add the double-colon
>> needed for the labeled list separator, as a result the added
>> documentation all got squashed into one paragraph. Fix that by adding
>> the list separator.
>
> Obviously the right thing to do, but your patch does not seem to apply
> on that commit. Looks like you built it on a more recent commit that
> also has checkout.defaultRemote (i.e., probably 'next')?

Yeah, it's based on top of next, which I was testing at the time and
didn't expect that conflict.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-09-19  4:41 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-24 18:01 [PATCH v1] checkout: optionally speed up "git checkout -b foo" Ben Peart
2018-07-24 18:42 ` Eric Sunshine
2018-07-24 19:45   ` Ben Peart
2018-07-26 15:04     ` Junio C Hamano
2018-07-26 18:59       ` Eric Sunshine
2018-07-26 19:08         ` Eric Sunshine
2018-07-24 19:21 ` Junio C Hamano
2018-07-24 20:47   ` Ben Peart
2018-07-31 16:39 ` [PATCH v2] checkout: optimize "git checkout -b <new_branch>" Ben Peart
2018-07-31 20:01   ` Junio C Hamano
2018-08-01 15:10   ` Duy Nguyen
2018-08-02 18:02     ` Ben Peart
2018-08-03 15:58       ` Duy Nguyen
2018-08-06 14:25         ` Ben Peart
2018-08-15 21:05           ` Ben Peart
2018-08-05  8:57       ` Duy Nguyen
2018-08-16 18:27 ` [PATCH v3] " Ben Peart
2018-08-16 18:37   ` Duy Nguyen
2018-08-17 12:37     ` Ben Peart
2018-08-19  1:44       ` Elijah Newren
2018-08-20 13:40         ` Ben Peart
2018-08-20 18:16           ` Elijah Newren
2018-08-21 14:51             ` Duy Nguyen
2018-08-30 17:22               ` Elijah Newren
2018-09-04 16:46                 ` Duy Nguyen
2018-08-20 18:31         ` Junio C Hamano
2018-09-18  5:34   ` [PATCH] config doc: add missing list separator for checkout.optimizeNewBranch Ævar Arnfjörð Bjarmason
2018-09-18 16:57     ` Taylor Blau
2018-09-18 17:16       ` Jeff King
2018-09-18 17:20         ` Taylor Blau
2018-09-18 17:13     ` Jeff King
2018-09-19  4:41       ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).