git grep --threads 12 --textconv is effectively single-threaded

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* git grep --threads 12 --textconv is effectively single-threaded
@ 2020-07-07 21:25 Zach Riggle
  2020-07-07 21:59 ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Zach Riggle @ 2020-07-07 21:25 UTC (permalink / raw)
  To: git

It looks like the bit of code that is responsible for performing
textconv conversions is single-threaded, even if git-grep is provided
a number of threads to use.

This greatly limits the usability of textconv filters for grep
expressions over a large number of files.

Note the difference in total CPU usage in the following expressions:

$ git grep --threads 12 -e foobar --and -e fizzbuzz &> /dev/null
0.24s user 0.28s system 710% cpu 0.073 total

$ git grep --threads 12 -e foobar --and -e fizzbuzz --textconv &> /dev/null
0.90s user 1.75s system 110% cpu 2.390 total

$ git version
git version 2.27.0

Zach Riggle

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git grep --threads 12 --textconv is effectively single-threaded
  2020-07-07 21:25 git grep --threads 12 --textconv is effectively single-threaded Zach Riggle
@ 2020-07-07 21:59 ` Jeff King
  2020-07-08 17:40   ` Zach Riggle
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2020-07-07 21:59 UTC (permalink / raw)
  To: Zach Riggle; +Cc: git

On Tue, Jul 07, 2020 at 04:25:01PM -0500, Zach Riggle wrote:

> It looks like the bit of code that is responsible for performing
> textconv conversions is single-threaded, even if git-grep is provided
> a number of threads to use.

Yes, the locking is much coarser than it could be. The issue is in
grep.c's fill_textconv_grep():

          /*
           * fill_textconv is not remotely thread-safe; it modifies the global
           * diff tempfile structure, writes to the_repo's odb and might
           * internally call thread-unsafe functions such as the
           * prepare_packed_git() lazy-initializator. Because of the last two, we
           * must ensure mutual exclusion between this call and the object reading
           * API, thus we use obj_read_lock() here.
           *
           * TODO: allowing text conversion to run in parallel with object
           * reading operations might increase performance in the multithreaded
           * non-worktreee git-grep with --textconv.
           */
          obj_read_lock();
          size = fill_textconv(r, driver, df, &buf);
          obj_read_unlock();
          free_filespec(df);

Note that this lock is used whether we're doing textconv's or not (i.e.,
it also excludes reading two objects from the object database at the
same time, because none of that code is thread-safe). But the latency
when we're doing a textconv is _much_ higher, because it's shelling out
to a separate process and reading/writing the contents. Note the
much-higher system CPU in your second timing:

> Note the difference in total CPU usage in the following expressions:
> 
> $ git grep --threads 12 -e foobar --and -e fizzbuzz &> /dev/null
> 0.24s user 0.28s system 710% cpu 0.073 total
> 
> $ git grep --threads 12 -e foobar --and -e fizzbuzz --textconv &> /dev/null
> 0.90s user 1.75s system 110% cpu 2.390 total

So I think implementing that TODO would help a lot (because each
textconv could in theory proceed in parallel).

As workaround in the meantime, I suspect that enabling
diff.<driver>.cachetextconv for your particular textconv config might
help. It would be slow on the first run, but then we'd be able to skip
the external process entirely for subsequent runs (the results are
cached in a git-notes tree, which are just raw object reads).

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git grep --threads 12 --textconv is effectively single-threaded
  2020-07-07 21:59 ` Jeff King
@ 2020-07-08 17:40   ` Zach Riggle
  2020-07-08 20:13     ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Zach Riggle @ 2020-07-08 17:40 UTC (permalink / raw)
  To: Jeff King; +Cc: git

So I just need to

git -c "diff.c.cachetextconv=true" ...

And the cache should automagically work?

Zach Riggle

On Tue, Jul 7, 2020 at 4:59 PM Jeff King <peff@peff.net> wrote:
>
> On Tue, Jul 07, 2020 at 04:25:01PM -0500, Zach Riggle wrote:
>
> > It looks like the bit of code that is responsible for performing
> > textconv conversions is single-threaded, even if git-grep is provided
> > a number of threads to use.
>
> Yes, the locking is much coarser than it could be. The issue is in
> grep.c's fill_textconv_grep():
>
>           /*
>            * fill_textconv is not remotely thread-safe; it modifies the global
>            * diff tempfile structure, writes to the_repo's odb and might
>            * internally call thread-unsafe functions such as the
>            * prepare_packed_git() lazy-initializator. Because of the last two, we
>            * must ensure mutual exclusion between this call and the object reading
>            * API, thus we use obj_read_lock() here.
>            *
>            * TODO: allowing text conversion to run in parallel with object
>            * reading operations might increase performance in the multithreaded
>            * non-worktreee git-grep with --textconv.
>            */
>           obj_read_lock();
>           size = fill_textconv(r, driver, df, &buf);
>           obj_read_unlock();
>           free_filespec(df);
>
> Note that this lock is used whether we're doing textconv's or not (i.e.,
> it also excludes reading two objects from the object database at the
> same time, because none of that code is thread-safe). But the latency
> when we're doing a textconv is _much_ higher, because it's shelling out
> to a separate process and reading/writing the contents. Note the
> much-higher system CPU in your second timing:
>
> > Note the difference in total CPU usage in the following expressions:
> >
> > $ git grep --threads 12 -e foobar --and -e fizzbuzz &> /dev/null
> > 0.24s user 0.28s system 710% cpu 0.073 total
> >
> > $ git grep --threads 12 -e foobar --and -e fizzbuzz --textconv &> /dev/null
> > 0.90s user 1.75s system 110% cpu 2.390 total
>
> So I think implementing that TODO would help a lot (because each
> textconv could in theory proceed in parallel).
>
> As workaround in the meantime, I suspect that enabling
> diff.<driver>.cachetextconv for your particular textconv config might
> help. It would be slow on the first run, but then we'd be able to skip
> the external process entirely for subsequent runs (the results are
> cached in a git-notes tree, which are just raw object reads).
>
> -Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git grep --threads 12 --textconv is effectively single-threaded
  2020-07-08 17:40   ` Zach Riggle
@ 2020-07-08 20:13     ` Jeff King
  2020-07-08 21:06       ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2020-07-08 20:13 UTC (permalink / raw)
  To: Zach Riggle; +Cc: git

On Wed, Jul 08, 2020 at 12:40:49PM -0500, Zach Riggle wrote:

> So I just need to
> 
> git -c "diff.c.cachetextconv=true" ...
> 
> And the cache should automagically work?

Yes, but there's a bit of subtlety with what you're grepping.

Try this example setup:

  git init -q repo
  cd repo
  echo content >file
  echo 'file diff=foo' >.gitattributes
  git add .
  git commit -m base
  git config diff.foo.textconv 'echo >&2 converting...; sleep 1; tr a-z A-Z <'

which should make it clear when the filter actually runs.

Now try this:

  echo >&2 "==> no textconv"
  git grep . file
  echo >&2 "==> with textconv"
  git grep --textconv . file
  echo >&2 "==> cached (one)"
  git -c diff.foo.cachetextconv=true grep --textconv . file
  echo >&2 "==> cached (two)"
  git -c diff.foo.cachetextconv=true grep --textconv . file

We'd expect the final one to say "file:CONTENT" but _without_ a
"converting" line. But it still runs the textconv script!

The problem is that the cache uses the blob id of the source as its key.
But since we're grepping files in the working directory, they don't have
an object id at all. If we grep in the tree, it works as expected:

  $ git -c diff.foo.cachetextconv=true grep --textconv . HEAD -- file
  converting...
  HEAD:file:CONTENT
  $ git -c diff.foo.cachetextconv=true grep --textconv . HEAD -- file
  HEAD:file:CONTENT

All of this textconv stuff was originally designed for the diff
infrastructure. Even when it's diffing files, if they're tracked in the
repository the diff code will pull the oid from the index (assuming the
file is stat-clean). But the grep code doesn't do that (and I doubt that
it matters for anything at all except this textconv caching feature).

It's probably possible to teach the grep code to do the same
check-in-the-index trick, but I'm not sure how complicated it would be.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git grep --threads 12 --textconv is effectively single-threaded
  2020-07-08 20:13     ` Jeff King
@ 2020-07-08 21:06       ` Junio C Hamano
  2020-07-09 23:10         ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2020-07-08 21:06 UTC (permalink / raw)
  To: Jeff King; +Cc: Zach Riggle, git

Jeff King <peff@peff.net> writes:

> It's probably possible to teach the grep code to do the same
> check-in-the-index trick, but I'm not sure how complicated it would be.

I am not sure if we should even depend on the "check the object
database and use it instead of reading the working tree files" done
in diff code---somehow I thought we did the opposite for performance
(i.e. when we ought to be comparing two objects, taken from tree and
the index, if we notice that the index side is stat clean, we can
read/mmap the working tree file instead of going to the object layer
and deflating a loose object, or, worse yet, construct the blob by
repeatedly applying deltas on a base object in a packfile).

Is this one in the opposite direction done specifically for gaining
performance when textconv cache is in use?  If so, kudos to whoever
did it---that sounds like a clever thing to do.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git grep --threads 12 --textconv is effectively single-threaded
  2020-07-08 21:06       ` Junio C Hamano
@ 2020-07-09 23:10         ` Jeff King
  2020-07-10 16:43           ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2020-07-09 23:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Zach Riggle, git

On Wed, Jul 08, 2020 at 02:06:31PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > It's probably possible to teach the grep code to do the same
> > check-in-the-index trick, but I'm not sure how complicated it would be.
> 
> I am not sure if we should even depend on the "check the object
> database and use it instead of reading the working tree files" done
> in diff code---somehow I thought we did the opposite for performance
> (i.e. when we ought to be comparing two objects, taken from tree and
> the index, if we notice that the index side is stat clean, we can
> read/mmap the working tree file instead of going to the object layer
> and deflating a loose object, or, worse yet, construct the blob by
> repeatedly applying deltas on a base object in a packfile).
> 
> Is this one in the opposite direction done specifically for gaining
> performance when textconv cache is in use?  If so, kudos to whoever
> did it---that sounds like a clever thing to do.

No, it turns out that nobody was that clever (and I was simply
misremembering how it worked).

For a tree-to-tree or index-to-tree comparison, both sides will have an
oid and can use the textconv cache. Even for an index case where we
might choose to use a stat-fresh working tree file as an optimization,
we'll still consult the textconv cache before loading those contents.

But for diffing a file in the working tree, we'll never have an oid and
will always run the textconv command). So "git diff" against the index,
for example, would run _one_ textconv (using the cached value for the
index, and running one for the working tree version). And we know that
isn't that interesting for optimizing, since by definition the file is
stat-dirty in that case (or else we'd skip the content-level comparison
entirely). So you'd have to compute the sha1 of the working tree file
from scratch. Plus the lifetime of a working tree file's entry in the
textconv cache is probably smaller, since it hasn't even been committed
yet.

I don't think I ever noticed because the primary thing I was trying to
speed up with the textconv cache is "git log -p", etc, which always has
an oid to work with.

But "grep" is a totally different story. It is frequently looking at all
of the stat-fresh working tree files.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: git grep --threads 12 --textconv is effectively single-threaded
  2020-07-09 23:10         ` Jeff King
@ 2020-07-10 16:43           ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2020-07-10 16:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Zach Riggle, git

Jeff King <peff@peff.net> writes:

> For a tree-to-tree or index-to-tree comparison, both sides will have an
> oid and can use the textconv cache. Even for an index case where we
> might choose to use a stat-fresh working tree file as an optimization,
> we'll still consult the textconv cache before loading those contents.

OK, that "we'll still consult" part does sound like sort-of
"clever", but after thinking about the whole sentence twice,
I realize the actual cleverness lies in the reuse of stat-fresh
working tree file, not the use of textconv cache ;-)  The cache
is just doing its normal thing: if we know the oid, look up the
cached one.

> But for diffing a file in the working tree, we'll never have an oid and
> will always run the textconv command).

OK.  In such a case, we need to run the clean filter on the working
tree contents, and then finally we need to run the textconv on the
result.  We could internally hash the result of applying the clean
filter to see if we have the blob in the object database and use it
as the look-up key in the textconv cache, but we are talking about
the working tree files, which by definition is more fluid than what
is in the index, which is likely more fluid than what is already
committed, so the chance of finding a hit may be slim.

We could still see if the oid in the index is correct with the stat
check, but by definition, diff-files won't compare between a cache
entry and a working tree file if the path is stat-clean, so that
does not help all that much.  I wonder if diff-index comparing a
tree and the working tree (i.e. without "--cached") can be improved,
though.

> So "git diff" against the index,
> for example, would run _one_ textconv (using the cached value for the
> index, and running one for the working tree version). And we know that
> isn't that interesting for optimizing, since by definition the file is
> stat-dirty in that case (or else we'd skip the content-level comparison
> entirely).

Yup, we reached the same conclusion here ;-)

> So you'd have to compute the sha1 of the working tree file
> from scratch. Plus the lifetime of a working tree file's entry in the
> textconv cache is probably smaller, since it hasn't even been committed
> yet.

Yes again.

> I don't think I ever noticed because the primary thing I was trying to
> speed up with the textconv cache is "git log -p", etc, which always has
> an oid to work with.

Absolutely.

> But "grep" is a totally different story. It is frequently looking at all
> of the stat-fresh working tree files.

Yeah.  Grepping in the working tree files could be optimized with
the same technique that would be used to optimize diff-index without
"--cached".  When we look at the working tree file, we consult the
index and possibly learn its object name if the path is stat-clean,
probably if and only if textconv is in use.

"git grep" would divert to the "grep in a blob object" codepath from
there, and "git diff-index" would make clever use of oid_valid bit
in diff_filespec when running textconv (we might need a separate
bit, though; I haven't thought it through).

Or something like that.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-07-10 16:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-07 21:25 git grep --threads 12 --textconv is effectively single-threaded Zach Riggle
2020-07-07 21:59 ` Jeff King
2020-07-08 17:40   ` Zach Riggle
2020-07-08 20:13     ` Jeff King
2020-07-08 21:06       ` Junio C Hamano
2020-07-09 23:10         ` Jeff King
2020-07-10 16:43           ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).