git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Jeff King <peff@peff.net>
Cc: Zach Riggle <zachriggle@gmail.com>, git@vger.kernel.org
Subject: Re: git grep --threads 12 --textconv is effectively single-threaded
Date: Fri, 10 Jul 2020 09:43:15 -0700	[thread overview]
Message-ID: <xmqq5zavuxr0.fsf@gitster.c.googlers.com> (raw)
In-Reply-To: <20200709231030.GD664420@coredump.intra.peff.net> (Jeff King's message of "Thu, 9 Jul 2020 19:10:30 -0400")

Jeff King <peff@peff.net> writes:

> For a tree-to-tree or index-to-tree comparison, both sides will have an
> oid and can use the textconv cache. Even for an index case where we
> might choose to use a stat-fresh working tree file as an optimization,
> we'll still consult the textconv cache before loading those contents.

OK, that "we'll still consult" part does sound like sort-of
"clever", but after thinking about the whole sentence twice,
I realize the actual cleverness lies in the reuse of stat-fresh
working tree file, not the use of textconv cache ;-)  The cache
is just doing its normal thing: if we know the oid, look up the
cached one.

> But for diffing a file in the working tree, we'll never have an oid and
> will always run the textconv command).

OK.  In such a case, we need to run the clean filter on the working
tree contents, and then finally we need to run the textconv on the
result.  We could internally hash the result of applying the clean
filter to see if we have the blob in the object database and use it
as the look-up key in the textconv cache, but we are talking about
the working tree files, which by definition is more fluid than what
is in the index, which is likely more fluid than what is already
committed, so the chance of finding a hit may be slim.

We could still see if the oid in the index is correct with the stat
check, but by definition, diff-files won't compare between a cache
entry and a working tree file if the path is stat-clean, so that
does not help all that much.  I wonder if diff-index comparing a
tree and the working tree (i.e. without "--cached") can be improved,
though.

> So "git diff" against the index,
> for example, would run _one_ textconv (using the cached value for the
> index, and running one for the working tree version). And we know that
> isn't that interesting for optimizing, since by definition the file is
> stat-dirty in that case (or else we'd skip the content-level comparison
> entirely).

Yup, we reached the same conclusion here ;-)

> So you'd have to compute the sha1 of the working tree file
> from scratch. Plus the lifetime of a working tree file's entry in the
> textconv cache is probably smaller, since it hasn't even been committed
> yet.

Yes again.

> I don't think I ever noticed because the primary thing I was trying to
> speed up with the textconv cache is "git log -p", etc, which always has
> an oid to work with.

Absolutely.

> But "grep" is a totally different story. It is frequently looking at all
> of the stat-fresh working tree files.

Yeah.  Grepping in the working tree files could be optimized with
the same technique that would be used to optimize diff-index without
"--cached".  When we look at the working tree file, we consult the
index and possibly learn its object name if the path is stat-clean,
probably if and only if textconv is in use.

"git grep" would divert to the "grep in a blob object" codepath from
there, and "git diff-index" would make clever use of oid_valid bit
in diff_filespec when running textconv (we might need a separate
bit, though; I haven't thought it through).

Or something like that.


      reply	other threads:[~2020-07-10 16:43 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-07 21:25 git grep --threads 12 --textconv is effectively single-threaded Zach Riggle
2020-07-07 21:59 ` Jeff King
2020-07-08 17:40   ` Zach Riggle
2020-07-08 20:13     ` Jeff King
2020-07-08 21:06       ` Junio C Hamano
2020-07-09 23:10         ` Jeff King
2020-07-10 16:43           ` Junio C Hamano [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqq5zavuxr0.fsf@gitster.c.googlers.com \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=zachriggle@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).