* [PATCH v1] convert: add test to demonstrate clean invocations @ 2016-07-21 20:59 larsxschneider 2016-07-21 21:37 ` Jeff King 0 siblings, 1 reply; 6+ messages in thread From: larsxschneider @ 2016-07-21 20:59 UTC (permalink / raw) To: git; +Cc: peff, Lars Schneider From: Lars Schneider <larsxschneider@gmail.com> If three files processed by a filter are added and committed to a repository then I expect three clean invocations per Git operation. Apparently Git invokes the clean process 12 times. Signed-off-by: Lars Schneider <larsxschneider@gmail.com> --- It's pretty late here and I hope do not miss something obvious... but can anyone explain to me why the clean operation is executed 12 times for 3 files? Thanks, Lars t/t0021-conversion.sh | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh index 7bac2bc..ab3d299 100755 --- a/t/t0021-conversion.sh +++ b/t/t0021-conversion.sh @@ -12,6 +12,12 @@ tr \ EOF chmod +x rot13.sh +cat <<EOF >run.sh +#!$SHELL_PATH + echo "." >> run.count +EOF +chmod +x run.sh + test_expect_success setup ' git config filter.rot13.smudge ./rot13.sh && git config filter.rot13.clean ./rot13.sh && @@ -268,4 +274,28 @@ test_expect_success 'disable filter with empty override' ' test_must_be_empty err ' +test_expect_failure 'required clean filter count' ' + test_config_global filter.run_count.clean ./../run.sh && + test_config_global filter.run_count.required true && + rm -rf repo && + mkdir repo && + ( + cd repo && + git init && + + echo "*.r filter=run_count" >.gitattributes && + + cat ../test.o >test.r && + echo "test22" >test2.r && + echo "test333" >test3.r && + + rm -f run.count && + git add . && + test_line_count = 3 run.count + + rm -f run.count && + git commit . -m "test commit" && + test_line_count = 3 run.count + ) +' test_done -- 2.5.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v1] convert: add test to demonstrate clean invocations 2016-07-21 20:59 [PATCH v1] convert: add test to demonstrate clean invocations larsxschneider @ 2016-07-21 21:37 ` Jeff King 2016-07-22 15:27 ` [PATCH] diff: do not reuse worktree files that need "clean" conversion Jeff King 0 siblings, 1 reply; 6+ messages in thread From: Jeff King @ 2016-07-21 21:37 UTC (permalink / raw) To: larsxschneider; +Cc: git On Thu, Jul 21, 2016 at 10:59:07PM +0200, larsxschneider@gmail.com wrote: > From: Lars Schneider <larsxschneider@gmail.com> > > If three files processed by a filter are added and committed to a > repository then I expect three clean invocations per Git operation. > Apparently Git invokes the clean process 12 times. > > Signed-off-by: Lars Schneider <larsxschneider@gmail.com> > --- > > It's pretty late here and I hope do not miss something obvious... but > can anyone explain to me why the clean operation is executed 12 times > for 3 files? Hmm. If I run this under gdb and backtrace when we hit apply_filter(), I see: > + rm -f run.count && > + git add . && > + test_line_count = 3 run.count This "git add" does invoke apply_filter() 3 times, which I'd expect. > + rm -f run.count && > + git commit . -m "test commit" && > + test_line_count = 3 run.count This invokes apply_filter() six times. So something funny is going on already (I do get 12 dots, so checking apply_filter() seems to only cover half the invocations). Three of them are for re-adding the three files to the index again, since "git commit ." is asking us to do so. I'm surprised, though; I would have thought we could avoid doing so just based on the stat information. Maybe it's a racy-index thing because the files' mtimes are likely to be the same as the index? Indeed, if I stick a "sleep 1" between modifying the files and calling "git add", then the "git commit" invocation drops to only 6 invocations of the filter. So that explains half of it (though I'm still confused why index refreshing requires 6 and not 3; I guess it may be because "git commit ." works in a separate index, and we first refresh that index, and then refresh the "real" index again afterwards, when we could theoretically just copy the entries). The next three are to show the final diffstat after the commit completes. It looks like the "should we reuse the worktree file?" optimization in diff_populate_filespec() does not take into account whether we will have to convert the contents, and it probably should. The point is that sometimes mmap-ing the file contents is more efficient than zlib inflating the object from disk. But if we have to exec an extra process and read the whole object contents into a strbuf, that is probably not a win. Adding a "return 0" at the top of reuse_worktree_file() drops our 6 to 3 (but obviously it should actually check the attributes). So of the 12 invocations: - 3 must be for refreshing the index again, because the way the test is written causes us to err on the side of caution when the mtimes are the same (and also means that even if your test is fixed to pass, it would racily fail when the system is under load) - 3 are for the double-refresh when "git commit ---only" is used (and "git commit ." implies "--only". Seems like there is room for optimization there. - 3 are for the tree-diff reusing the worktree files. This should be dropped. - The other 3, I'm not sure. -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH] diff: do not reuse worktree files that need "clean" conversion 2016-07-21 21:37 ` Jeff King @ 2016-07-22 15:27 ` Jeff King 2016-07-22 17:44 ` Junio C Hamano 0 siblings, 1 reply; 6+ messages in thread From: Jeff King @ 2016-07-22 15:27 UTC (permalink / raw) To: larsxschneider; +Cc: git On Thu, Jul 21, 2016 at 05:37:40PM -0400, Jeff King wrote: > The next three are to show the final diffstat after the commit > completes. It looks like the "should we reuse the worktree file?" > optimization in diff_populate_filespec() does not take into account > whether we will have to convert the contents, and it probably should. > The point is that sometimes mmap-ing the file contents is more efficient > than zlib inflating the object from disk. But if we have to exec an > extra process and read the whole object contents into a strbuf, that is > probably not a win. > > Adding a "return 0" at the top of reuse_worktree_file() drops our 6 to > 3 (but obviously it should actually check the attributes). Here's a patch for that. I put it in t0021, as well, since I couldn't find a more appropriate place. This means it conflicts with your earlier patch, but I think it does a better job of addressing the specific area it fixes. We may want further tests on top as we fix more areas (though as I said earlier, I think by avoiding the racy timestamps, your "commit" call drops to just a single invocation per file. That still seems like one more than we should need, but it at least matches your original expectation :) ). -- >8 -- Subject: diff: do not reuse worktree files that need "clean" conversion When accessing a blob for a diff, we may try to reuse file contents in the working tree, under the theory that it is faster to mmap those file contents than it would be to extract the content from the object database. When we have to filter those contents, though, that assumption does not hold. Even for our internal conversions like CRLF, we have to allocate and fill a new buffer anyway. But much worse, for external clean filters we have to exec an arbitrary script, and we have no idea how expensive it may be to run. So let's skip this optimization when conversion into git's "clean" form is required. This applies whenever the "want_file" flag is false. When it's true, the caller actually wants the smudged worktree contents, which the reused file by definition already has (in fact, this is a key optimization going the other direction, since reusing the worktree file there lets us skip smudge filters). Signed-off-by: Jeff King <peff@peff.net> --- diff.c | 7 +++++++ t/t0021-conversion.sh | 11 +++++++++++ 2 files changed, 18 insertions(+) diff --git a/diff.c b/diff.c index 7d03419..b43d3dd 100644 --- a/diff.c +++ b/diff.c @@ -2683,6 +2683,13 @@ static int reuse_worktree_file(const char *name, const unsigned char *sha1, int if (!FAST_WORKING_DIRECTORY && !want_file && has_sha1_pack(sha1)) return 0; + /* + * Similarly, if we'd have to convert the file contents anyway, that + * makes the optimization not worthwhile. + */ + if (!want_file && would_convert_to_git(name)) + return 0; + len = strlen(name); pos = cache_name_pos(name, len); if (pos < 0) diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh index 7bac2bc..e799e59 100755 --- a/t/t0021-conversion.sh +++ b/t/t0021-conversion.sh @@ -268,4 +268,15 @@ test_expect_success 'disable filter with empty override' ' test_must_be_empty err ' +test_expect_success 'diff does not reuse worktree files that need cleaning' ' + test_config filter.counter.clean "echo . >>count; sed s/^/clean:/" && + echo "file filter=counter" >.gitattributes && + test_commit one file && + test_commit two file && + + >count && + git diff-tree -p HEAD && + test_line_count = 0 count +' + test_done -- 2.9.2.506.g8452fe7 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] diff: do not reuse worktree files that need "clean" conversion 2016-07-22 15:27 ` [PATCH] diff: do not reuse worktree files that need "clean" conversion Jeff King @ 2016-07-22 17:44 ` Junio C Hamano 2016-07-22 18:10 ` Jeff King 0 siblings, 1 reply; 6+ messages in thread From: Junio C Hamano @ 2016-07-22 17:44 UTC (permalink / raw) To: Jeff King; +Cc: larsxschneider, git Jeff King <peff@peff.net> writes: > Subject: diff: do not reuse worktree files that need "clean" conversion > > When accessing a blob for a diff, we may try to reuse file > contents in the working tree, under the theory that it is > faster to mmap those file contents than it would be to > extract the content from the object database. > > When we have to filter those contents, though, that > assumption does not hold. Even for our internal conversions > like CRLF, we have to allocate and fill a new buffer anyway. > But much worse, for external clean filters we have to exec > an arbitrary script, and we have no idea how expensive it > may be to run. > > So let's skip this optimization when conversion into git's > "clean" form is required. This applies whenever the > "want_file" flag is false. When it's true, the caller > actually wants the smudged worktree contents, which the > reused file by definition already has (in fact, this is a > key optimization going the other direction, since reusing > the worktree file there lets us skip smudge filters). > > Signed-off-by: Jeff King <peff@peff.net> > --- > diff.c | 7 +++++++ > t/t0021-conversion.sh | 11 +++++++++++ > 2 files changed, 18 insertions(+) > > diff --git a/diff.c b/diff.c > index 7d03419..b43d3dd 100644 > --- a/diff.c > +++ b/diff.c > @@ -2683,6 +2683,13 @@ static int reuse_worktree_file(const char *name, const unsigned char *sha1, int > if (!FAST_WORKING_DIRECTORY && !want_file && has_sha1_pack(sha1)) > return 0; > > + /* > + * Similarly, if we'd have to convert the file contents anyway, that > + * makes the optimization not worthwhile. > + */ > + if (!want_file && would_convert_to_git(name)) > + return 0; The would_convert_to_git() function is not a free operation. It needs to prime the attribute stack, so it needs to open/read/parse a few files ($GIT_DIR/info/attributes and .gitattributes at least, and more if your directory hierarchy is deep) on the filesystem. The cost is amortized across paths, but we do not even enable the optimization if we have to pay the cost of reading the index ourselves. I suspect that we may be better off disabling this optimization if we need to always call the helper. > len = strlen(name); > pos = cache_name_pos(name, len); > if (pos < 0) > diff --git a/t/t0021-conversion.sh b/t/t0021-conversion.sh > index 7bac2bc..e799e59 100755 > --- a/t/t0021-conversion.sh > +++ b/t/t0021-conversion.sh > @@ -268,4 +268,15 @@ test_expect_success 'disable filter with empty override' ' > test_must_be_empty err > ' > > +test_expect_success 'diff does not reuse worktree files that need cleaning' ' > + test_config filter.counter.clean "echo . >>count; sed s/^/clean:/" && > + echo "file filter=counter" >.gitattributes && > + test_commit one file && > + test_commit two file && > + > + >count && > + git diff-tree -p HEAD && > + test_line_count = 0 count > +' > + > test_done ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] diff: do not reuse worktree files that need "clean" conversion 2016-07-22 17:44 ` Junio C Hamano @ 2016-07-22 18:10 ` Jeff King 2016-07-22 19:30 ` Junio C Hamano 0 siblings, 1 reply; 6+ messages in thread From: Jeff King @ 2016-07-22 18:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: larsxschneider, git On Fri, Jul 22, 2016 at 10:44:08AM -0700, Junio C Hamano wrote: > > diff --git a/diff.c b/diff.c > > index 7d03419..b43d3dd 100644 > > --- a/diff.c > > +++ b/diff.c > > @@ -2683,6 +2683,13 @@ static int reuse_worktree_file(const char *name, const unsigned char *sha1, int > > if (!FAST_WORKING_DIRECTORY && !want_file && has_sha1_pack(sha1)) > > return 0; > > > > + /* > > + * Similarly, if we'd have to convert the file contents anyway, that > > + * makes the optimization not worthwhile. > > + */ > > + if (!want_file && would_convert_to_git(name)) > > + return 0; > > The would_convert_to_git() function is not a free operation. It > needs to prime the attribute stack, so it needs to open/read/parse a > few files ($GIT_DIR/info/attributes and .gitattributes at least, and > more if your directory hierarchy is deep) on the filesystem. The > cost is amortized across paths, but we do not even enable the > optimization if we have to pay the cost of reading the index > ourselves. Yeah, I almost commented on that, and its position in the function, but forgot to. The only code path which will trigger this is diff_populate_filespec. After reuse_worktree_file() says "yes, we can reuse", we drop into a conditional that will end in us calling convert_to_git() anyway, which will do the same lookup. We don't need to cache, because the expensive parts of the attribute-lookup are already cached for us by the attribute code. So my initial thought was to put it at the end of reuse_worktree_file(), where it would have the least impact. If we find we cannot reuse the file, then we would skip both this new attr lookup and the one in diff_populate_filespec(). But in practice, I think we'll already have cached those attrs before we even hit this function, because we'll hit the userdiff_find_by_path() code earlier in the diff process (e.g., to see if it's binary, if we need to textconv, etc). Those look for different attributes, but I think the expensive bits (finding, opening, reading attribute files) are cached across all lookups. So I think we actually _can_ think of would_convert_to_git() as basically free. Or as free as other efficient-lookup functions we call like cache_name_pos(). And so I moved it further up in the function, where it lets us avoid doing more out-of-process work (like calling lstat() so we can ce_match_stat() on the result). Possibly it should go after the cache_name_pos() call, though. That's likely to be less expensive than the actual walk of the attr tree. > I suspect that we may be better off disabling this optimization if > we need to always call the helper. The thought "does this tree reuse even speed things up enough to justify its complexity" definitely crossed my mind. It's basically swapping open/mmap for zlib inflating the content. But I do think it helps in the "want_file" case (i.e., when we are writing out a tempfile for an external command via prepare_temp_file()). There it helps us omit writing a tempfile to disk entirely, including any possible conversion. -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] diff: do not reuse worktree files that need "clean" conversion 2016-07-22 18:10 ` Jeff King @ 2016-07-22 19:30 ` Junio C Hamano 0 siblings, 0 replies; 6+ messages in thread From: Junio C Hamano @ 2016-07-22 19:30 UTC (permalink / raw) To: Jeff King; +Cc: larsxschneider, git Jeff King <peff@peff.net> writes: > So I think we actually _can_ think of would_convert_to_git() as > basically free. Or as free as other efficient-lookup functions we call > like cache_name_pos(). OK. > The thought "does this tree reuse even speed things up enough to justify > its complexity" definitely crossed my mind. It's basically swapping > open/mmap for zlib inflating the content. > > But I do think it helps in the "want_file" case (i.e., when we are > writing out a tempfile for an external command via prepare_temp_file()). > There it helps us omit writing a tempfile to disk entirely, including > any possible conversion. OK. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-07-22 19:30 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-07-21 20:59 [PATCH v1] convert: add test to demonstrate clean invocations larsxschneider 2016-07-21 21:37 ` Jeff King 2016-07-22 15:27 ` [PATCH] diff: do not reuse worktree files that need "clean" conversion Jeff King 2016-07-22 17:44 ` Junio C Hamano 2016-07-22 18:10 ` Jeff King 2016-07-22 19:30 ` Junio C Hamano
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).