git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] commit: Optimize number of lstat() calls
@ 2017-02-03 23:22 Gumbel, Matthew K
  2017-02-03 23:38 ` Junio C Hamano
  2017-02-04  6:50 ` Junio C Hamano
  0 siblings, 2 replies; 6+ messages in thread
From: Gumbel, Matthew K @ 2017-02-03 23:22 UTC (permalink / raw)
  To: git@vger.kernel.org

When making a --only commit, original behavior was to do a full cache
update for the purposes of giving the pre-commit hook an up-to-date set 
of stat data. That would result in long runtime for git-commit in a big 
repo on NFS (>60s for a 54k-file repo).

With this change, when doing a --only commit and no pre-commit hook is
present, the cache update is skipped since it is known a priori which
files are to be committed.

This was discussed on the mailing list here:
https://public-inbox.org/git/DA0A42D68346B1469147552440A645039A9C56D4@ORSMSX115.amr.corp.intel.com/

Signed-off-by: Matthew K. Gumbel <matthew.k.gumbel@intel.com>
---
builtin/commit.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/builtin/commit.c b/builtin/commit.c
index 2de5f6c..1df3d71 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -470,7 +470,9 @@ static const char *prepare_index(int argc, const char **argv, const char *prefix

    hold_locked_index(&index_lock, LOCK_DIE_ON_ERROR);
    add_remove_files(&partial);
-   refresh_cache(REFRESH_QUIET);
+    if (find_hook("pre-commit")) {
+        refresh_cache(REFRESH_QUIET);
+    }
    update_main_cache_tree(WRITE_TREE_SILENT);
    if (write_locked_index(&the_index, &index_lock, CLOSE_LOCK))
        die(_("unable to write new_index file"));
@@ -482,7 +484,9 @@ static const char *prepare_index(int argc, const char **argv, const char *prefix

    create_base_index(current_head);
    add_remove_files(&partial);
-   refresh_cache(REFRESH_QUIET);
+    if (find_hook("pre-commit")) {
+        refresh_cache(REFRESH_QUIET);
+    }

    if (write_locked_index(&the_index, &false_lock, CLOSE_LOCK))
        die(_("unable to write temporary index file"));
-- 
2.8.4



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] commit: Optimize number of lstat() calls
  2017-02-03 23:22 [PATCH] commit: Optimize number of lstat() calls Gumbel, Matthew K
@ 2017-02-03 23:38 ` Junio C Hamano
  2017-02-04  6:50 ` Junio C Hamano
  1 sibling, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2017-02-03 23:38 UTC (permalink / raw)
  To: Gumbel, Matthew K; +Cc: git@vger.kernel.org

"Gumbel, Matthew K" <matthew.k.gumbel@intel.com> writes:

> When making a --only commit, original behavior was to do a full cache
> update for the purposes of giving the pre-commit hook an up-to-date set 
> of stat data. That would result in long runtime for git-commit in a big 
> repo on NFS (>60s for a 54k-file repo).
>
> With this change, when doing a --only commit and no pre-commit hook is
> present, the cache update is skipped since it is known a priori which
> files are to be committed.

Did you determine that "pre-commit" hook is the only thing that
needs the special case and if so how?  With that stated in the log
message, we would feel safe to take this patch.

> Signed-off-by: Matthew K. Gumbel <matthew.k.gumbel@intel.com>
> ---
> builtin/commit.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/builtin/commit.c b/builtin/commit.c
> index 2de5f6c..1df3d71 100644
> --- a/builtin/commit.c
> +++ b/builtin/commit.c
> @@ -470,7 +470,9 @@ static const char *prepare_index(int argc, const char **argv, const char *prefix
>
>     hold_locked_index(&index_lock, LOCK_DIE_ON_ERROR);
>     add_remove_files(&partial);
> -   refresh_cache(REFRESH_QUIET);
> +    if (find_hook("pre-commit")) {
> +        refresh_cache(REFRESH_QUIET);
> +    }
>     update_main_cache_tree(WRITE_TREE_SILENT);
>     if (write_locked_index(&the_index, &index_lock, CLOSE_LOCK))
>         die(_("unable to write new_index file"));

I wonder why this patch is full of 0xa0 bytes instead of spaces.
Somebody between your editor and mailing list is munging your
patches?

> @@ -482,7 +484,9 @@ static const char *prepare_index(int argc, const char **argv, const char *prefix
>
>     create_base_index(current_head);
>     add_remove_files(&partial);
> -   refresh_cache(REFRESH_QUIET);
> +    if (find_hook("pre-commit")) {
> +        refresh_cache(REFRESH_QUIET);
> +    }
>
>     if (write_locked_index(&the_index, &false_lock, CLOSE_LOCK))
>         die(_("unable to write temporary index file"));

I'd prefer if you introduced a small helper function, name it
can_cheat_on_refreshing() or something, and encapsulate the call to
find_hook("pre-commit") inside it---if there are other corner cases
you didn't think of, or if in the future more hooks are added that
need their incoming index file already refreshed, we do not want to
update these two places---rather we'd want to limit the need for
updating to the single helper function.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] commit: Optimize number of lstat() calls
  2017-02-04  6:50 ` Junio C Hamano
@ 2017-02-04  0:35   ` Johannes Schindelin
  2017-02-04  7:23     ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Johannes Schindelin @ 2017-02-04  0:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Gumbel, Matthew K, git@vger.kernel.org

Hi Junio,

On Fri, 3 Feb 2017, Junio C Hamano wrote:

> Aside from whitespace breakage, I think we should never skip the
> refreshing of the real index that is left after "git commit"
> finishes.
> 
> Between these two calls to refresh_cache(), the one that writes out
> a temporary index that contains the contents of HEAD plus the
> contents of the working tree for the specified paths may be fine
> without refreshing, unless somebody else (like the pre-commit hook)
> looks at it.  But the other one refreshes the real index file that
> will be used after "git commit" returns the control.  Users and
> scripts that run "git commit" inside expect that the entries in the
> resulting index are refreshed after "git commit" returns, and I do
> not think of a safe way to optimizing it out; unlike the other one,
> to which we can say "as long as there is no pre-commit hook, nobody
> will look at it after we are done", there does not an easy-to-check
> set of conditions that we can use to decide when it is safe to skip
> refreshing.
> 
> Besides, leaving the main index not refreshed would mean the user
> has to pay the refreshing cost when s/he runs other commands "git
> diff", "git status", etc. after "git commit" for the first time;
> so...

I am not sure... when you run `git diff` and `git status`, the index is
refreshed *anyway*, so with the patch under discussion we would save one
round of lstat() calls, for the same effect.

I could imagine that there is a third option we should consider, too: only
lstat() and update the paths that match the pathspec(s) provided on the
command line (this is the semantic meaning of the --only option, after
all: "I am only interested in these paths as far as this commit is
concerned"). What do you think?

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] commit: Optimize number of lstat() calls
  2017-02-03 23:22 [PATCH] commit: Optimize number of lstat() calls Gumbel, Matthew K
  2017-02-03 23:38 ` Junio C Hamano
@ 2017-02-04  6:50 ` Junio C Hamano
  2017-02-04  0:35   ` Johannes Schindelin
  1 sibling, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2017-02-04  6:50 UTC (permalink / raw)
  To: Gumbel, Matthew K; +Cc: git@vger.kernel.org

Aside from whitespace breakage, I think we should never skip the
refreshing of the real index that is left after "git commit"
finishes.

Between these two calls to refresh_cache(), the one that writes out
a temporary index that contains the contents of HEAD plus the
contents of the working tree for the specified paths may be fine
without refreshing, unless somebody else (like the pre-commit hook)
looks at it.  But the other one refreshes the real index file that
will be used after "git commit" returns the control.  Users and
scripts that run "git commit" inside expect that the entries in the
resulting index are refreshed after "git commit" returns, and I do
not think of a safe way to optimizing it out; unlike the other one,
to which we can say "as long as there is no pre-commit hook, nobody
will look at it after we are done", there does not an easy-to-check
set of conditions that we can use to decide when it is safe to skip
refreshing.

Besides, leaving the main index not refreshed would mean the user
has to pay the refreshing cost when s/he runs other commands "git
diff", "git status", etc. after "git commit" for the first time;
so...


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] commit: Optimize number of lstat() calls
  2017-02-04  0:35   ` Johannes Schindelin
@ 2017-02-04  7:23     ` Junio C Hamano
  2017-02-04 10:42       ` Johannes Schindelin
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2017-02-04  7:23 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Gumbel, Matthew K, git@vger.kernel.org

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> Besides, leaving the main index not refreshed would mean the user
>> has to pay the refreshing cost when s/he runs other commands "git
>> diff", "git status", etc. after "git commit" for the first time;
>> so...
>
> I am not sure... when you run `git diff` and `git status`, the index is
> refreshed *anyway*, so with the patch under discussion we would save one
> round of lstat() calls, for the same effect.

Yeah, you're right.  The only ones that could be affected are
plumbing commands, and scripts that use plumbing are expected to be
written without relying on the "refreshed"-ness of the index they
are given (iow, if they want to rely on, they are expected to refresh
first before using plumbing commands).  So there is no downside of
leaving the index in an unrefreshed state as long as everbody plays
by the rule.

> I could imagine that there is a third option we should consider, too: only
> lstat() and update the paths that match the pathspec(s) provided on the
> command line (this is the semantic meaning of the --only option, after
> all: "I am only interested in these paths as far as this commit is
> concerned"). What do you think?

I wondered that myself when I read the first message from Matthew
and noticed that we always refresh the entire index.  But if it is
OK to leave the entire index un-refreshed, that would even be
simpler ;-)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] commit: Optimize number of lstat() calls
  2017-02-04  7:23     ` Junio C Hamano
@ 2017-02-04 10:42       ` Johannes Schindelin
  0 siblings, 0 replies; 6+ messages in thread
From: Johannes Schindelin @ 2017-02-04 10:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Gumbel, Matthew K, git@vger.kernel.org

Hi Junio,


On Fri, 3 Feb 2017, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > I could imagine that there is a third option we should consider, too: only
> > lstat() and update the paths that match the pathspec(s) provided on the
> > command line (this is the semantic meaning of the --only option, after
> > all: "I am only interested in these paths as far as this commit is
> > concerned"). What do you think?
> 
> I wondered that myself when I read the first message from Matthew
> and noticed that we always refresh the entire index.  But if it is
> OK to leave the entire index un-refreshed, that would even be
> simpler ;-)

Hah! You're right, that would be much simpler ;-)

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-02-04 10:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-03 23:22 [PATCH] commit: Optimize number of lstat() calls Gumbel, Matthew K
2017-02-03 23:38 ` Junio C Hamano
2017-02-04  6:50 ` Junio C Hamano
2017-02-04  0:35   ` Johannes Schindelin
2017-02-04  7:23     ` Junio C Hamano
2017-02-04 10:42       ` Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).