* git grep --threads 12 --textconv is effectively single-threaded @ 2020-07-07 21:25 Zach Riggle 2020-07-07 21:59 ` Jeff King 0 siblings, 1 reply; 7+ messages in thread From: Zach Riggle @ 2020-07-07 21:25 UTC (permalink / raw) To: git It looks like the bit of code that is responsible for performing textconv conversions is single-threaded, even if git-grep is provided a number of threads to use. This greatly limits the usability of textconv filters for grep expressions over a large number of files. Note the difference in total CPU usage in the following expressions: $ git grep --threads 12 -e foobar --and -e fizzbuzz &> /dev/null 0.24s user 0.28s system 710% cpu 0.073 total $ git grep --threads 12 -e foobar --and -e fizzbuzz --textconv &> /dev/null 0.90s user 1.75s system 110% cpu 2.390 total $ git version git version 2.27.0 Zach Riggle ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: git grep --threads 12 --textconv is effectively single-threaded 2020-07-07 21:25 git grep --threads 12 --textconv is effectively single-threaded Zach Riggle @ 2020-07-07 21:59 ` Jeff King 2020-07-08 17:40 ` Zach Riggle 0 siblings, 1 reply; 7+ messages in thread From: Jeff King @ 2020-07-07 21:59 UTC (permalink / raw) To: Zach Riggle; +Cc: git On Tue, Jul 07, 2020 at 04:25:01PM -0500, Zach Riggle wrote: > It looks like the bit of code that is responsible for performing > textconv conversions is single-threaded, even if git-grep is provided > a number of threads to use. Yes, the locking is much coarser than it could be. The issue is in grep.c's fill_textconv_grep(): /* * fill_textconv is not remotely thread-safe; it modifies the global * diff tempfile structure, writes to the_repo's odb and might * internally call thread-unsafe functions such as the * prepare_packed_git() lazy-initializator. Because of the last two, we * must ensure mutual exclusion between this call and the object reading * API, thus we use obj_read_lock() here. * * TODO: allowing text conversion to run in parallel with object * reading operations might increase performance in the multithreaded * non-worktreee git-grep with --textconv. */ obj_read_lock(); size = fill_textconv(r, driver, df, &buf); obj_read_unlock(); free_filespec(df); Note that this lock is used whether we're doing textconv's or not (i.e., it also excludes reading two objects from the object database at the same time, because none of that code is thread-safe). But the latency when we're doing a textconv is _much_ higher, because it's shelling out to a separate process and reading/writing the contents. Note the much-higher system CPU in your second timing: > Note the difference in total CPU usage in the following expressions: > > $ git grep --threads 12 -e foobar --and -e fizzbuzz &> /dev/null > 0.24s user 0.28s system 710% cpu 0.073 total > > $ git grep --threads 12 -e foobar --and -e fizzbuzz --textconv &> /dev/null > 0.90s user 1.75s system 110% cpu 2.390 total So I think implementing that TODO would help a lot (because each textconv could in theory proceed in parallel). As workaround in the meantime, I suspect that enabling diff.<driver>.cachetextconv for your particular textconv config might help. It would be slow on the first run, but then we'd be able to skip the external process entirely for subsequent runs (the results are cached in a git-notes tree, which are just raw object reads). -Peff ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: git grep --threads 12 --textconv is effectively single-threaded 2020-07-07 21:59 ` Jeff King @ 2020-07-08 17:40 ` Zach Riggle 2020-07-08 20:13 ` Jeff King 0 siblings, 1 reply; 7+ messages in thread From: Zach Riggle @ 2020-07-08 17:40 UTC (permalink / raw) To: Jeff King; +Cc: git So I just need to git -c "diff.c.cachetextconv=true" ... And the cache should automagically work? Zach Riggle On Tue, Jul 7, 2020 at 4:59 PM Jeff King <peff@peff.net> wrote: > > On Tue, Jul 07, 2020 at 04:25:01PM -0500, Zach Riggle wrote: > > > It looks like the bit of code that is responsible for performing > > textconv conversions is single-threaded, even if git-grep is provided > > a number of threads to use. > > Yes, the locking is much coarser than it could be. The issue is in > grep.c's fill_textconv_grep(): > > /* > * fill_textconv is not remotely thread-safe; it modifies the global > * diff tempfile structure, writes to the_repo's odb and might > * internally call thread-unsafe functions such as the > * prepare_packed_git() lazy-initializator. Because of the last two, we > * must ensure mutual exclusion between this call and the object reading > * API, thus we use obj_read_lock() here. > * > * TODO: allowing text conversion to run in parallel with object > * reading operations might increase performance in the multithreaded > * non-worktreee git-grep with --textconv. > */ > obj_read_lock(); > size = fill_textconv(r, driver, df, &buf); > obj_read_unlock(); > free_filespec(df); > > Note that this lock is used whether we're doing textconv's or not (i.e., > it also excludes reading two objects from the object database at the > same time, because none of that code is thread-safe). But the latency > when we're doing a textconv is _much_ higher, because it's shelling out > to a separate process and reading/writing the contents. Note the > much-higher system CPU in your second timing: > > > Note the difference in total CPU usage in the following expressions: > > > > $ git grep --threads 12 -e foobar --and -e fizzbuzz &> /dev/null > > 0.24s user 0.28s system 710% cpu 0.073 total > > > > $ git grep --threads 12 -e foobar --and -e fizzbuzz --textconv &> /dev/null > > 0.90s user 1.75s system 110% cpu 2.390 total > > So I think implementing that TODO would help a lot (because each > textconv could in theory proceed in parallel). > > As workaround in the meantime, I suspect that enabling > diff.<driver>.cachetextconv for your particular textconv config might > help. It would be slow on the first run, but then we'd be able to skip > the external process entirely for subsequent runs (the results are > cached in a git-notes tree, which are just raw object reads). > > -Peff ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: git grep --threads 12 --textconv is effectively single-threaded 2020-07-08 17:40 ` Zach Riggle @ 2020-07-08 20:13 ` Jeff King 2020-07-08 21:06 ` Junio C Hamano 0 siblings, 1 reply; 7+ messages in thread From: Jeff King @ 2020-07-08 20:13 UTC (permalink / raw) To: Zach Riggle; +Cc: git On Wed, Jul 08, 2020 at 12:40:49PM -0500, Zach Riggle wrote: > So I just need to > > git -c "diff.c.cachetextconv=true" ... > > And the cache should automagically work? Yes, but there's a bit of subtlety with what you're grepping. Try this example setup: git init -q repo cd repo echo content >file echo 'file diff=foo' >.gitattributes git add . git commit -m base git config diff.foo.textconv 'echo >&2 converting...; sleep 1; tr a-z A-Z <' which should make it clear when the filter actually runs. Now try this: echo >&2 "==> no textconv" git grep . file echo >&2 "==> with textconv" git grep --textconv . file echo >&2 "==> cached (one)" git -c diff.foo.cachetextconv=true grep --textconv . file echo >&2 "==> cached (two)" git -c diff.foo.cachetextconv=true grep --textconv . file We'd expect the final one to say "file:CONTENT" but _without_ a "converting" line. But it still runs the textconv script! The problem is that the cache uses the blob id of the source as its key. But since we're grepping files in the working directory, they don't have an object id at all. If we grep in the tree, it works as expected: $ git -c diff.foo.cachetextconv=true grep --textconv . HEAD -- file converting... HEAD:file:CONTENT $ git -c diff.foo.cachetextconv=true grep --textconv . HEAD -- file HEAD:file:CONTENT All of this textconv stuff was originally designed for the diff infrastructure. Even when it's diffing files, if they're tracked in the repository the diff code will pull the oid from the index (assuming the file is stat-clean). But the grep code doesn't do that (and I doubt that it matters for anything at all except this textconv caching feature). It's probably possible to teach the grep code to do the same check-in-the-index trick, but I'm not sure how complicated it would be. -Peff ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: git grep --threads 12 --textconv is effectively single-threaded 2020-07-08 20:13 ` Jeff King @ 2020-07-08 21:06 ` Junio C Hamano 2020-07-09 23:10 ` Jeff King 0 siblings, 1 reply; 7+ messages in thread From: Junio C Hamano @ 2020-07-08 21:06 UTC (permalink / raw) To: Jeff King; +Cc: Zach Riggle, git Jeff King <peff@peff.net> writes: > It's probably possible to teach the grep code to do the same > check-in-the-index trick, but I'm not sure how complicated it would be. I am not sure if we should even depend on the "check the object database and use it instead of reading the working tree files" done in diff code---somehow I thought we did the opposite for performance (i.e. when we ought to be comparing two objects, taken from tree and the index, if we notice that the index side is stat clean, we can read/mmap the working tree file instead of going to the object layer and deflating a loose object, or, worse yet, construct the blob by repeatedly applying deltas on a base object in a packfile). Is this one in the opposite direction done specifically for gaining performance when textconv cache is in use? If so, kudos to whoever did it---that sounds like a clever thing to do. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: git grep --threads 12 --textconv is effectively single-threaded 2020-07-08 21:06 ` Junio C Hamano @ 2020-07-09 23:10 ` Jeff King 2020-07-10 16:43 ` Junio C Hamano 0 siblings, 1 reply; 7+ messages in thread From: Jeff King @ 2020-07-09 23:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: Zach Riggle, git On Wed, Jul 08, 2020 at 02:06:31PM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > It's probably possible to teach the grep code to do the same > > check-in-the-index trick, but I'm not sure how complicated it would be. > > I am not sure if we should even depend on the "check the object > database and use it instead of reading the working tree files" done > in diff code---somehow I thought we did the opposite for performance > (i.e. when we ought to be comparing two objects, taken from tree and > the index, if we notice that the index side is stat clean, we can > read/mmap the working tree file instead of going to the object layer > and deflating a loose object, or, worse yet, construct the blob by > repeatedly applying deltas on a base object in a packfile). > > Is this one in the opposite direction done specifically for gaining > performance when textconv cache is in use? If so, kudos to whoever > did it---that sounds like a clever thing to do. No, it turns out that nobody was that clever (and I was simply misremembering how it worked). For a tree-to-tree or index-to-tree comparison, both sides will have an oid and can use the textconv cache. Even for an index case where we might choose to use a stat-fresh working tree file as an optimization, we'll still consult the textconv cache before loading those contents. But for diffing a file in the working tree, we'll never have an oid and will always run the textconv command). So "git diff" against the index, for example, would run _one_ textconv (using the cached value for the index, and running one for the working tree version). And we know that isn't that interesting for optimizing, since by definition the file is stat-dirty in that case (or else we'd skip the content-level comparison entirely). So you'd have to compute the sha1 of the working tree file from scratch. Plus the lifetime of a working tree file's entry in the textconv cache is probably smaller, since it hasn't even been committed yet. I don't think I ever noticed because the primary thing I was trying to speed up with the textconv cache is "git log -p", etc, which always has an oid to work with. But "grep" is a totally different story. It is frequently looking at all of the stat-fresh working tree files. -Peff ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: git grep --threads 12 --textconv is effectively single-threaded 2020-07-09 23:10 ` Jeff King @ 2020-07-10 16:43 ` Junio C Hamano 0 siblings, 0 replies; 7+ messages in thread From: Junio C Hamano @ 2020-07-10 16:43 UTC (permalink / raw) To: Jeff King; +Cc: Zach Riggle, git Jeff King <peff@peff.net> writes: > For a tree-to-tree or index-to-tree comparison, both sides will have an > oid and can use the textconv cache. Even for an index case where we > might choose to use a stat-fresh working tree file as an optimization, > we'll still consult the textconv cache before loading those contents. OK, that "we'll still consult" part does sound like sort-of "clever", but after thinking about the whole sentence twice, I realize the actual cleverness lies in the reuse of stat-fresh working tree file, not the use of textconv cache ;-) The cache is just doing its normal thing: if we know the oid, look up the cached one. > But for diffing a file in the working tree, we'll never have an oid and > will always run the textconv command). OK. In such a case, we need to run the clean filter on the working tree contents, and then finally we need to run the textconv on the result. We could internally hash the result of applying the clean filter to see if we have the blob in the object database and use it as the look-up key in the textconv cache, but we are talking about the working tree files, which by definition is more fluid than what is in the index, which is likely more fluid than what is already committed, so the chance of finding a hit may be slim. We could still see if the oid in the index is correct with the stat check, but by definition, diff-files won't compare between a cache entry and a working tree file if the path is stat-clean, so that does not help all that much. I wonder if diff-index comparing a tree and the working tree (i.e. without "--cached") can be improved, though. > So "git diff" against the index, > for example, would run _one_ textconv (using the cached value for the > index, and running one for the working tree version). And we know that > isn't that interesting for optimizing, since by definition the file is > stat-dirty in that case (or else we'd skip the content-level comparison > entirely). Yup, we reached the same conclusion here ;-) > So you'd have to compute the sha1 of the working tree file > from scratch. Plus the lifetime of a working tree file's entry in the > textconv cache is probably smaller, since it hasn't even been committed > yet. Yes again. > I don't think I ever noticed because the primary thing I was trying to > speed up with the textconv cache is "git log -p", etc, which always has > an oid to work with. Absolutely. > But "grep" is a totally different story. It is frequently looking at all > of the stat-fresh working tree files. Yeah. Grepping in the working tree files could be optimized with the same technique that would be used to optimize diff-index without "--cached". When we look at the working tree file, we consult the index and possibly learn its object name if the path is stat-clean, probably if and only if textconv is in use. "git grep" would divert to the "grep in a blob object" codepath from there, and "git diff-index" would make clever use of oid_valid bit in diff_filespec when running textconv (we might need a separate bit, though; I haven't thought it through). Or something like that. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-07-10 16:43 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-07 21:25 git grep --threads 12 --textconv is effectively single-threaded Zach Riggle 2020-07-07 21:59 ` Jeff King 2020-07-08 17:40 ` Zach Riggle 2020-07-08 20:13 ` Jeff King 2020-07-08 21:06 ` Junio C Hamano 2020-07-09 23:10 ` Jeff King 2020-07-10 16:43 ` Junio C Hamano
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).