git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH v1] convert: add test to demonstrate clean invocations
@ 2016-07-21 20:59 larsxschneider
  2016-07-21 21:37 ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: larsxschneider @ 2016-07-21 20:59 UTC (permalink / raw)
  To: git; +Cc: peff, Lars Schneider

From: Lars Schneider <larsxschneider@gmail.com>

If three files processed by a filter are added and committed to a
repository then I expect three clean invocations per Git operation.
Apparently Git invokes the clean process 12 times.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
---

It's pretty late here and I hope do not miss something obvious... but
can anyone explain to me why the clean operation is executed 12 times
for 3 files?

Thanks,
Lars


 t/t0021-conversion.sh | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 7bac2bc..ab3d299 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -12,6 +12,12 @@ tr \
 EOF
 chmod +x rot13.sh

+cat <<EOF >run.sh
+#!$SHELL_PATH
+	echo "." >> run.count
+EOF
+chmod +x run.sh
+
 test_expect_success setup '
 	git config filter.rot13.smudge ./rot13.sh &&
 	git config filter.rot13.clean ./rot13.sh &&
@@ -268,4 +274,28 @@ test_expect_success 'disable filter with empty override' '
 	test_must_be_empty err
 '

+test_expect_failure 'required clean filter count' '
+	test_config_global filter.run_count.clean ./../run.sh &&
+	test_config_global filter.run_count.required true &&
+	rm -rf repo &&
+	mkdir repo &&
+	(
+		cd repo &&
+		git init &&
+
+		echo "*.r filter=run_count" >.gitattributes &&
+
+		cat ../test.o >test.r &&
+		echo "test22" >test2.r &&
+		echo "test333" >test3.r &&
+
+		rm -f run.count &&
+		git add . &&
+		test_line_count = 3 run.count
+
+		rm -f run.count &&
+		git commit . -m "test commit" &&
+		test_line_count = 3 run.count
+	)
+'
 test_done
--
2.5.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v1] convert: add test to demonstrate clean invocations
  2016-07-21 20:59 [PATCH v1] convert: add test to demonstrate clean invocations larsxschneider
@ 2016-07-21 21:37 ` Jeff King
  2016-07-22 15:27   ` [PATCH] diff: do not reuse worktree files that need "clean" conversion Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2016-07-21 21:37 UTC (permalink / raw)
  To: larsxschneider; +Cc: git

On Thu, Jul 21, 2016 at 10:59:07PM +0200, larsxschneider@gmail.com wrote:

> From: Lars Schneider <larsxschneider@gmail.com>
> 
> If three files processed by a filter are added and committed to a
> repository then I expect three clean invocations per Git operation.
> Apparently Git invokes the clean process 12 times.
> 
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> 
> It's pretty late here and I hope do not miss something obvious... but
> can anyone explain to me why the clean operation is executed 12 times
> for 3 files?

Hmm. If I run this under gdb and backtrace when we hit apply_filter(), I
see:

> +		rm -f run.count &&
> +		git add . &&
> +		test_line_count = 3 run.count

This "git add" does invoke apply_filter() 3 times, which I'd expect.

> +		rm -f run.count &&
> +		git commit . -m "test commit" &&
> +		test_line_count = 3 run.count

This invokes apply_filter() six times. So something funny is going on
already (I do get 12 dots, so checking apply_filter() seems to only
cover half the invocations).

Three of them are for re-adding the three files to the index again,
since "git commit ." is asking us to do so. I'm surprised, though; I
would have thought we could avoid doing so just based on the stat
information. Maybe it's a racy-index thing because the files' mtimes are
likely to be the same as the index?

Indeed, if I stick a "sleep 1" between modifying the files and calling
"git add", then the "git commit" invocation drops to only 6 invocations
of the filter. So that explains half of it (though I'm still confused
why index refreshing requires 6 and not 3; I guess it may be because
"git commit ." works in a separate index, and we first refresh that
index, and then refresh the "real" index again afterwards, when we could
theoretically just copy the entries).

The next three are to show the final diffstat after the commit
completes. It looks like the "should we reuse the worktree file?"
optimization in diff_populate_filespec() does not take into account
whether we will have to convert the contents, and it probably should.
The point is that sometimes mmap-ing the file contents is more efficient
than zlib inflating the object from disk. But if we have to exec an
extra process and read the whole object contents into a strbuf, that is
probably not a win.

Adding a "return 0" at the top of reuse_worktree_file() drops our 6 to
3 (but obviously it should actually check the attributes).

So of the 12 invocations:

  - 3 must be for refreshing the index again, because the way the test
    is written causes us to err on the side of caution when the mtimes
    are the same (and also means that even if your test is fixed to
    pass, it would racily fail when the system is under load)

  - 3 are for the double-refresh when "git commit ---only" is used (and
    "git commit ." implies "--only". Seems like there is room for
    optimization there.

  - 3 are for the tree-diff reusing the worktree files. This should be
    dropped.

  - The other 3, I'm not sure.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] diff: do not reuse worktree files that need "clean" conversion
  2016-07-21 21:37 ` Jeff King
@ 2016-07-22 15:27   ` Jeff King
  2016-07-22 17:44     ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2016-07-22 15:27 UTC (permalink / raw)
  To: larsxschneider; +Cc: git

On Thu, Jul 21, 2016 at 05:37:40PM -0400, Jeff King wrote:

> The next three are to show the final diffstat after the commit
> completes. It looks like the "should we reuse the worktree file?"
> optimization in diff_populate_filespec() does not take into account
> whether we will have to convert the contents, and it probably should.
> The point is that sometimes mmap-ing the file contents is more efficient
> than zlib inflating the object from disk. But if we have to exec an
> extra process and read the whole object contents into a strbuf, that is
> probably not a win.
> 
> Adding a "return 0" at the top of reuse_worktree_file() drops our 6 to
> 3 (but obviously it should actually check the attributes).

Here's a patch for that. I put it in t0021, as well, since I couldn't
find a more appropriate place. This means it conflicts with your earlier
patch, but I think it does a better job of addressing the specific area
it fixes.

We may want further tests on top as we fix more areas (though as I said
earlier, I think by avoiding the racy timestamps, your "commit" call
drops to just a single invocation per file. That still seems like one
more than we should need, but it at least matches your original
expectation :) ).

-- >8 --
Subject: diff: do not reuse worktree files that need "clean" conversion

When accessing a blob for a diff, we may try to reuse file
contents in the working tree, under the theory that it is
faster to mmap those file contents than it would be to
extract the content from the object database.

When we have to filter those contents, though, that
assumption does not hold. Even for our internal conversions
like CRLF, we have to allocate and fill a new buffer anyway.
But much worse, for external clean filters we have to exec
an arbitrary script, and we have no idea how expensive it
may be to run.

So let's skip this optimization when conversion into git's
"clean" form is required. This applies whenever the
"want_file" flag is false. When it's true, the caller
actually wants the smudged worktree contents, which the
reused file by definition already has (in fact, this is a
key optimization going the other direction, since reusing
the worktree file there lets us skip smudge filters).

Signed-off-by: Jeff King <peff@peff.net>
---
 diff.c                |  7 +++++++
 t/t0021-conversion.sh | 11 +++++++++++
 2 files changed, 18 insertions(+)

diff --git a/diff.c b/diff.c
index 7d03419..b43d3dd 100644
--- a/diff.c
+++ b/diff.c
@@ -2683,6 +2683,13 @@ static int reuse_worktree_file(const char *name, const unsigned char *sha1, int
 	if (!FAST_WORKING_DIRECTORY && !want_file && has_sha1_pack(sha1))
 		return 0;
 
+	/*
+	 * Similarly, if we'd have to convert the file contents anyway, that
+	 * makes the optimization not worthwhile.
+	 */
+	if (!want_file && would_convert_to_git(name))
+		return 0;
+
 	len = strlen(name);
 	pos = cache_name_pos(name, len);
 	if (pos < 0)
diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
index 7bac2bc..e799e59 100755
--- a/t/t0021-conversion.sh
+++ b/t/t0021-conversion.sh
@@ -268,4 +268,15 @@ test_expect_success 'disable filter with empty override' '
 	test_must_be_empty err
 '
 
+test_expect_success 'diff does not reuse worktree files that need cleaning' '
+	test_config filter.counter.clean "echo . >>count; sed s/^/clean:/" &&
+	echo "file filter=counter" >.gitattributes &&
+	test_commit one file &&
+	test_commit two file &&
+
+	>count &&
+	git diff-tree -p HEAD &&
+	test_line_count = 0 count
+'
+
 test_done
-- 
2.9.2.506.g8452fe7


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] diff: do not reuse worktree files that need "clean" conversion
  2016-07-22 15:27   ` [PATCH] diff: do not reuse worktree files that need "clean" conversion Jeff King
@ 2016-07-22 17:44     ` Junio C Hamano
  2016-07-22 18:10       ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2016-07-22 17:44 UTC (permalink / raw)
  To: Jeff King; +Cc: larsxschneider, git

Jeff King <peff@peff.net> writes:

> Subject: diff: do not reuse worktree files that need "clean" conversion
>
> When accessing a blob for a diff, we may try to reuse file
> contents in the working tree, under the theory that it is
> faster to mmap those file contents than it would be to
> extract the content from the object database.
>
> When we have to filter those contents, though, that
> assumption does not hold. Even for our internal conversions
> like CRLF, we have to allocate and fill a new buffer anyway.
> But much worse, for external clean filters we have to exec
> an arbitrary script, and we have no idea how expensive it
> may be to run.
>
> So let's skip this optimization when conversion into git's
> "clean" form is required. This applies whenever the
> "want_file" flag is false. When it's true, the caller
> actually wants the smudged worktree contents, which the
> reused file by definition already has (in fact, this is a
> key optimization going the other direction, since reusing
> the worktree file there lets us skip smudge filters).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  diff.c                |  7 +++++++
>  t/t0021-conversion.sh | 11 +++++++++++
>  2 files changed, 18 insertions(+)
>
> diff --git a/diff.c b/diff.c
> index 7d03419..b43d3dd 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2683,6 +2683,13 @@ static int reuse_worktree_file(const char *name, const unsigned char *sha1, int
>  	if (!FAST_WORKING_DIRECTORY && !want_file && has_sha1_pack(sha1))
>  		return 0;
>  
> +	/*
> +	 * Similarly, if we'd have to convert the file contents anyway, that
> +	 * makes the optimization not worthwhile.
> +	 */
> +	if (!want_file && would_convert_to_git(name))
> +		return 0;

The would_convert_to_git() function is not a free operation.  It
needs to prime the attribute stack, so it needs to open/read/parse a
few files ($GIT_DIR/info/attributes and .gitattributes at least, and
more if your directory hierarchy is deep) on the filesystem.  The
cost is amortized across paths, but we do not even enable the
optimization if we have to pay the cost of reading the index
ourselves.

I suspect that we may be better off disabling this optimization if
we need to always call the helper.

>  	len = strlen(name);
>  	pos = cache_name_pos(name, len);
>  	if (pos < 0)
> diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh
> index 7bac2bc..e799e59 100755
> --- a/t/t0021-conversion.sh
> +++ b/t/t0021-conversion.sh
> @@ -268,4 +268,15 @@ test_expect_success 'disable filter with empty override' '
>  	test_must_be_empty err
>  '
>  
> +test_expect_success 'diff does not reuse worktree files that need cleaning' '
> +	test_config filter.counter.clean "echo . >>count; sed s/^/clean:/" &&
> +	echo "file filter=counter" >.gitattributes &&
> +	test_commit one file &&
> +	test_commit two file &&
> +
> +	>count &&
> +	git diff-tree -p HEAD &&
> +	test_line_count = 0 count
> +'
> +
>  test_done

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] diff: do not reuse worktree files that need "clean" conversion
  2016-07-22 17:44     ` Junio C Hamano
@ 2016-07-22 18:10       ` Jeff King
  2016-07-22 19:30         ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2016-07-22 18:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: larsxschneider, git

On Fri, Jul 22, 2016 at 10:44:08AM -0700, Junio C Hamano wrote:

> > diff --git a/diff.c b/diff.c
> > index 7d03419..b43d3dd 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -2683,6 +2683,13 @@ static int reuse_worktree_file(const char *name, const unsigned char *sha1, int
> >  	if (!FAST_WORKING_DIRECTORY && !want_file && has_sha1_pack(sha1))
> >  		return 0;
> >  
> > +	/*
> > +	 * Similarly, if we'd have to convert the file contents anyway, that
> > +	 * makes the optimization not worthwhile.
> > +	 */
> > +	if (!want_file && would_convert_to_git(name))
> > +		return 0;
> 
> The would_convert_to_git() function is not a free operation.  It
> needs to prime the attribute stack, so it needs to open/read/parse a
> few files ($GIT_DIR/info/attributes and .gitattributes at least, and
> more if your directory hierarchy is deep) on the filesystem.  The
> cost is amortized across paths, but we do not even enable the
> optimization if we have to pay the cost of reading the index
> ourselves.

Yeah, I almost commented on that, and its position in the function, but
forgot to.

The only code path which will trigger this is diff_populate_filespec.
After reuse_worktree_file() says "yes, we can reuse", we drop into a
conditional that will end in us calling convert_to_git() anyway, which
will do the same lookup. We don't need to cache, because the expensive
parts of the attribute-lookup are already cached for us by the attribute
code.

So my initial thought was to put it at the end of reuse_worktree_file(),
where it would have the least impact. If we find we cannot reuse the
file, then we would skip both this new attr lookup and the one in
diff_populate_filespec().

But in practice, I think we'll already have cached those attrs before we
even hit this function, because we'll hit the userdiff_find_by_path()
code earlier in the diff process (e.g., to see if it's binary, if we
need to textconv, etc). Those look for different attributes, but I think
the expensive bits (finding, opening, reading attribute files) are
cached across all lookups.

So I think we actually _can_ think of would_convert_to_git() as
basically free. Or as free as other efficient-lookup functions we call
like cache_name_pos(). And so I moved it further up in the function,
where it lets us avoid doing more out-of-process work (like calling
lstat() so we can ce_match_stat() on the result).

Possibly it should go after the cache_name_pos() call, though. That's
likely to be less expensive than the actual walk of the attr tree.

> I suspect that we may be better off disabling this optimization if
> we need to always call the helper.

The thought "does this tree reuse even speed things up enough to justify
its complexity" definitely crossed my mind. It's basically swapping
open/mmap for zlib inflating the content.

But I do think it helps in the "want_file" case (i.e., when we are
writing out a tempfile for an external command via prepare_temp_file()).
There it helps us omit writing a tempfile to disk entirely, including
any possible conversion.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] diff: do not reuse worktree files that need "clean" conversion
  2016-07-22 18:10       ` Jeff King
@ 2016-07-22 19:30         ` Junio C Hamano
  0 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2016-07-22 19:30 UTC (permalink / raw)
  To: Jeff King; +Cc: larsxschneider, git

Jeff King <peff@peff.net> writes:

> So I think we actually _can_ think of would_convert_to_git() as
> basically free. Or as free as other efficient-lookup functions we call
> like cache_name_pos().

OK.

> The thought "does this tree reuse even speed things up enough to justify
> its complexity" definitely crossed my mind. It's basically swapping
> open/mmap for zlib inflating the content.
>
> But I do think it helps in the "want_file" case (i.e., when we are
> writing out a tempfile for an external command via prepare_temp_file()).
> There it helps us omit writing a tempfile to disk entirely, including
> any possible conversion.

OK.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-07-22 19:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-21 20:59 [PATCH v1] convert: add test to demonstrate clean invocations larsxschneider
2016-07-21 21:37 ` Jeff King
2016-07-22 15:27   ` [PATCH] diff: do not reuse worktree files that need "clean" conversion Jeff King
2016-07-22 17:44     ` Junio C Hamano
2016-07-22 18:10       ` Jeff King
2016-07-22 19:30         ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).