git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [BUG?] Major performance issue with some commands on our repo's master branch
@ 2022-06-04  7:39 Tassilo Horn
  2022-06-04 20:20 ` Tao Klerks
  0 siblings, 1 reply; 12+ messages in thread
From: Tassilo Horn @ 2022-06-04  7:39 UTC (permalink / raw)
  To: git

Hi all,

[spoiler alert: I've figured out the config option causing the problem
while writing this long mail, so you might jump straight to the SOLUTION
section at the bottom of this mail.]

at my day job, I work on a git repo (sadly non-public, proprietary) with
these stats:

- master has about 150000 commits, the last release branch I've also benchmarked above has 144000 commits
- the history dates back to 2001
- .git/ is about 1.8 GB

So it's quite big but not unusually big when compared to linux or other
free software projects.

The typical git commands I use (status, fetch, pull, commit, push,
rebase, ...) are all quick.  However, I use the git porcelain Magit [1]
which invokes several plumbing commands in order to present to the user
an always up-to-date extended status buffer of the currently checked out
branch showing the current branch.  Some of those plumbing commands are
extremely slow for no obvious reasons.  The most outstanding command I
could pinpoint is this:

--8<---------------cut here---------------start------------->8---
❯ time git show --no-patch --format="%h %s" "master^{commit}" --
6192a0cfdc6 Merge remote-tracking branch 'origin/SHD_ECORO_3_9_7'

________________________________________________________
Executed in   13.21 secs    fish           external
   usr time   12.99 secs  462.00 micros   12.99 secs
   sys time    0.17 secs  119.00 micros    0.17 secs
--8<---------------cut here---------------end--------------->8---

The interesting thing is that I have this problem only with the master
branch.  When I run it for the last release branch, I get these times:

--8<---------------cut here---------------start------------->8---
❯ time git show --no-patch --format="%h %s" "SHD_ECORO_3_9_7^{commit}" --
994334fc9fb ECOJ-33833 HTML-Formbrief: Bestellungs-Anhänge im KV-Kontext

________________________________________________________
Executed in   22.68 millis    fish           external
   usr time    7.71 millis  761.00 micros    6.95 millis
   sys time   10.47 millis  194.00 micros   10.28 millis
--8<---------------cut here---------------end--------------->8---

So you see, it's almost a factor of 1000 difference!  How can that be?

The split between master and the SHD_ECORO_3_X_X series of branches has
happened almost 2 years ago and master is way ahead of those.

--8<---------------cut here---------------start------------->8---
❯ git log --oneline master...origin/SHD_ECORO_3_9_7 | wc -l
5013
--8<---------------cut here---------------end--------------->8---

But there are around 9 merges from the last release branch into master
daily.

--8<---------------cut here---------------start------------->8---
❯ git log --merges --oneline --since 6months | wc -l
1611
--8<---------------cut here---------------end--------------->8---

From my memory, the issue hasn't popped up out of sudden but has gotten
worse slowly over time.  I have the impression that the worsening
increased pace over the last few month which might be the result of our
workflow.  Before, I've been the merge guy doing two "merge waves" from
the last supported release branch upwards into master once or twice a
day (usually release-branch -> next-release-branch -> master).  Since
about 3 month, we've switched to a workflow where every developer does
merge upwards herself just after committing/pushing to some lesser
branch than master simply because branches have diverged so much that
you'd need to be an expert in everything in order to be able to resolve
conflicts sensibly.

I should mention that I haven't seen this issue with any other repo I
have.  But that's also the biggest one I use.  The Emacs repository I
also work on is comparable in the number of commits but with much less
merges.

At last, here's the git bugreport sysinfo section on that machine and
repository.

--8<---------------cut here---------------start------------->8---
[System Info]
git version:
git version 2.36.1
cpu: x86_64
no commit associated with this build
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
uname: Linux 5.18.1-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Mon, 30 May 2022 17:53:16 +0000 x86_64
compiler info: gnuc: 11.2
libc info: glibc: 2.35
$SHELL (typically, interactive shell): /usr/bin/fish

[Enabled Hooks]
--8<---------------cut here---------------end--------------->8---

SOLUTION
========

While writing this long mail, I've figured out that the performance
penalty is caused by my setting of diff.renameLimit = 10000.  If I
comment that option in my ~/.gitconfig, the above command finishes in
150 millis instead of 13 seconds:

--8<---------------cut here---------------start------------->8---
❯ time git show --no-patch --format="%h %s" "master^{commit}" --
6192a0cfdc6 Merge remote-tracking branch 'origin/SHD_ECORO_3_9_7'

________________________________________________________
Executed in  147.99 millis    fish           external
   usr time  114.52 millis  713.00 micros  113.81 millis
   sys time   34.78 millis  193.00 micros   34.59 millis
--8<---------------cut here---------------end--------------->8---

But there's still the question why diff.renameLimit has an influence
here when --no-patch is provided so no diff should be generated.

Bye,
Tassilo

[1] https://magit.vc/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG?] Major performance issue with some commands on our repo's master branch
  2022-06-04  7:39 [BUG?] Major performance issue with some commands on our repo's master branch Tassilo Horn
@ 2022-06-04 20:20 ` Tao Klerks
  2022-06-05 10:46   ` Tassilo Horn
  0 siblings, 1 reply; 12+ messages in thread
From: Tao Klerks @ 2022-06-04 20:20 UTC (permalink / raw)
  To: Tassilo Horn; +Cc: git

(resending as text-only after having stupidly replied from my mobile)

I can add a couple things that may or may not be related here. I work
with a large proprietary repo, like you, and it is also not absurdly
large. I maintain some custom tooling for a large scale perforce
interop process.

I used to use "git show" (without patch) in this custom tooling to get
commit metadata, because it has the advantage that you can specify an
arbitrary list of commits in one call, saving some process overheads
in Windows especially.

I stopped using "git show" when user reports of slowness eventually
revealed two things:

1. Large commits (eg merges to feature branches from the fast-moving
main trunk) were taking a surprisingly long time, despite the
no-patch, which made me think it was doing the patch work anyway, and
just discarding it at the end.

2. Merge commits from long-outdated feature branches, even though the
final patch displayed by "git show" is small, also take a long time.
It seems as though whatever patch-related work "git show" does (and
given your observations I guess it might well be rename-detection), it
does it with respect to *both parents* in the case of a merge request,
even though the patch it shows is only changes wrt the first parent.

All this to say: I haven't understood your branch setup, but I'm
guessing that you're regularly integrating work from "far-behind"
branches, and most or all of your commits on master are therefore
merges with large diffs wrt the second parent, and those large diffs
wrt the second parent are what's "getting worse".

I haven't attempted to debug this, and personally have little
incentive to do, as switching to "git log" and accepting the process
overheads solved *my* problem.

If I get the chance to, I will obviously report back here.

Thanks,
Tao

On Sat, Jun 4, 2022 at 10:29 AM Tassilo Horn <tsdh@gnu.org> wrote:
>
> Hi all,
>
> [spoiler alert: I've figured out the config option causing the problem
> while writing this long mail, so you might jump straight to the SOLUTION
> section at the bottom of this mail.]
>
> at my day job, I work on a git repo (sadly non-public, proprietary) with
> these stats:
>
> - master has about 150000 commits, the last release branch I've also benchmarked above has 144000 commits
> - the history dates back to 2001
> - .git/ is about 1.8 GB
>
> So it's quite big but not unusually big when compared to linux or other
> free software projects.
>
> The typical git commands I use (status, fetch, pull, commit, push,
> rebase, ...) are all quick.  However, I use the git porcelain Magit [1]
> which invokes several plumbing commands in order to present to the user
> an always up-to-date extended status buffer of the currently checked out
> branch showing the current branch.  Some of those plumbing commands are
> extremely slow for no obvious reasons.  The most outstanding command I
> could pinpoint is this:
>
> --8<---------------cut here---------------start------------->8---
> ❯ time git show --no-patch --format="%h %s" "master^{commit}" --
> 6192a0cfdc6 Merge remote-tracking branch 'origin/SHD_ECORO_3_9_7'
>
> ________________________________________________________
> Executed in   13.21 secs    fish           external
>    usr time   12.99 secs  462.00 micros   12.99 secs
>    sys time    0.17 secs  119.00 micros    0.17 secs
> --8<---------------cut here---------------end--------------->8---
>
> The interesting thing is that I have this problem only with the master
> branch.  When I run it for the last release branch, I get these times:
>
> --8<---------------cut here---------------start------------->8---
> ❯ time git show --no-patch --format="%h %s" "SHD_ECORO_3_9_7^{commit}" --
> 994334fc9fb ECOJ-33833 HTML-Formbrief: Bestellungs-Anhänge im KV-Kontext
>
> ________________________________________________________
> Executed in   22.68 millis    fish           external
>    usr time    7.71 millis  761.00 micros    6.95 millis
>    sys time   10.47 millis  194.00 micros   10.28 millis
> --8<---------------cut here---------------end--------------->8---
>
> So you see, it's almost a factor of 1000 difference!  How can that be?
>
> The split between master and the SHD_ECORO_3_X_X series of branches has
> happened almost 2 years ago and master is way ahead of those.
>
> --8<---------------cut here---------------start------------->8---
> ❯ git log --oneline master...origin/SHD_ECORO_3_9_7 | wc -l
> 5013
> --8<---------------cut here---------------end--------------->8---
>
> But there are around 9 merges from the last release branch into master
> daily.
>
> --8<---------------cut here---------------start------------->8---
> ❯ git log --merges --oneline --since 6months | wc -l
> 1611
> --8<---------------cut here---------------end--------------->8---
>
> From my memory, the issue hasn't popped up out of sudden but has gotten
> worse slowly over time.  I have the impression that the worsening
> increased pace over the last few month which might be the result of our
> workflow.  Before, I've been the merge guy doing two "merge waves" from
> the last supported release branch upwards into master once or twice a
> day (usually release-branch -> next-release-branch -> master).  Since
> about 3 month, we've switched to a workflow where every developer does
> merge upwards herself just after committing/pushing to some lesser
> branch than master simply because branches have diverged so much that
> you'd need to be an expert in everything in order to be able to resolve
> conflicts sensibly.
>
> I should mention that I haven't seen this issue with any other repo I
> have.  But that's also the biggest one I use.  The Emacs repository I
> also work on is comparable in the number of commits but with much less
> merges.
>
> At last, here's the git bugreport sysinfo section on that machine and
> repository.
>
> --8<---------------cut here---------------start------------->8---
> [System Info]
> git version:
> git version 2.36.1
> cpu: x86_64
> no commit associated with this build
> sizeof-long: 8
> sizeof-size_t: 8
> shell-path: /bin/sh
> uname: Linux 5.18.1-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Mon, 30 May 2022 17:53:16 +0000 x86_64
> compiler info: gnuc: 11.2
> libc info: glibc: 2.35
> $SHELL (typically, interactive shell): /usr/bin/fish
>
> [Enabled Hooks]
> --8<---------------cut here---------------end--------------->8---
>
> SOLUTION
> ========
>
> While writing this long mail, I've figured out that the performance
> penalty is caused by my setting of diff.renameLimit = 10000.  If I
> comment that option in my ~/.gitconfig, the above command finishes in
> 150 millis instead of 13 seconds:
>
> --8<---------------cut here---------------start------------->8---
> ❯ time git show --no-patch --format="%h %s" "master^{commit}" --
> 6192a0cfdc6 Merge remote-tracking branch 'origin/SHD_ECORO_3_9_7'
>
> ________________________________________________________
> Executed in  147.99 millis    fish           external
>    usr time  114.52 millis  713.00 micros  113.81 millis
>    sys time   34.78 millis  193.00 micros   34.59 millis
> --8<---------------cut here---------------end--------------->8---
>
> But there's still the question why diff.renameLimit has an influence
> here when --no-patch is provided so no diff should be generated.
>
> Bye,
> Tassilo
>
> [1] https://magit.vc/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG?] Major performance issue with some commands on our repo's master branch
  2022-06-04 20:20 ` Tao Klerks
@ 2022-06-05 10:46   ` Tassilo Horn
  2022-06-06  5:18     ` Tao Klerks
  2022-06-08 23:36     ` Jeff King
  0 siblings, 2 replies; 12+ messages in thread
From: Tassilo Horn @ 2022-06-05 10:46 UTC (permalink / raw)
  To: Tao Klerks; +Cc: git

Tao Klerks <tao@klerks.biz> writes:

Hi Tao,

thanks for your response.

> All this to say: I haven't understood your branch setup, but I'm
> guessing that you're regularly integrating work from "far-behind"
> branches, and most or all of your commits on master are therefore
> merges with large diffs wrt the second parent, and those large diffs
> wrt the second parent are what's "getting worse".

That's exactly correct.

> I haven't attempted to debug this, and personally have little
> incentive to do, as switching to "git log" and accepting the process
> overheads solved *my* problem.

And I'm happy to report it solves *my* problem as well.  There's a PR
for the Magit git porcelain replacing "git show" with an equivalent "git
log" incarnation which makes the 30seconds "refresh status buffer"
operation instant.

  https://github.com/magit/magit/issues/4702
  https://github.com/magit/magit/compare/km/show-to-log

Still maybe someone might want to have a look at the "git show" issue to
double-check if the performance burden in this specific case (no diff
should be generated) is warranted.  But at least I can work again with
no coffee-break long pauses, so I'm all satisfied. :-)

Thanks a lot for your insights.

Bye,
Tassilo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG?] Major performance issue with some commands on our repo's master branch
  2022-06-05 10:46   ` Tassilo Horn
@ 2022-06-06  5:18     ` Tao Klerks
  2022-06-08 23:36     ` Jeff King
  1 sibling, 0 replies; 12+ messages in thread
From: Tao Klerks @ 2022-06-06  5:18 UTC (permalink / raw)
  To: Tassilo Horn; +Cc: git

On Sun, Jun 5, 2022 at 12:55 PM Tassilo Horn <tsdh@gnu.org> wrote:
>
> Tao Klerks <tao@klerks.biz> writes:
>
> > I haven't attempted to debug this, and personally have little
> > incentive to do, as switching to "git log" and accepting the process
> > overheads solved *my* problem.
>
> And I'm happy to report it solves *my* problem as well.  There's a PR
> for the Magit git porcelain replacing "git show" with an equivalent "git
> log" incarnation which makes the 30seconds "refresh status buffer"
> operation instant.
>
>   https://github.com/magit/magit/issues/4702
>   https://github.com/magit/magit/compare/km/show-to-log
>
> Still maybe someone might want to have a look at the "git show" issue to
> double-check if the performance burden in this specific case (no diff
> should be generated) is warranted.

I spent a little time with this yesterday, and can confirm:
* My issue seems to be the same as yours, "export GIT_TRACE2_PERF=1"
shows all the time being spent in rename detection
* "git show" is a slightly different entry point into the "git log"
code (log.c, cmd_show())
* options to the "git log" functionality are largely collected in a
"rev_info" object (defined in revision.h)
* one option is the "-c / --diff-merges=combined" option of "git log"
(setting rev_info.diff, rev_info.combine_merges and
rev_info.dense_combined_merges)
* another option is "-s / --no-patch" option of "git log" (setting
rev_info.diffopt.output_format |= DIFF_FORMAT_NO_OUTPUT)
* the "--patch" and "--no-patch" options seem to be (and are
documented as) opposites, but they are not implemented as such; one
calls for the work to be done, and the other only hides the output.
* the "diffs" options are set automatically/implicitly for "git show",
before argument parsing
* we can simulate the "git show" performance issue in "--git log -1"
by setting "--diff-merges=combined" *and* "--no-patch" explicitly
* this performance issue does *not* present in a "git log" call with
only the regular "--patch" argument, however; this basic "show
patches" instruction defaults to "--diff-merges=off", which means the
rename detection work does not need to happen. There might still be
slight diff-related overheads, but they are undetectable in my
testing.

Therefore, I have confirmed it is possible to get "git show" to behave
the way we would expect for "--no-patch", by *also* specifying
"--diff-merges=off".

There are at least two possible approaches / directions to improving
these issues generically in the "git show" implementation, I think:

1. Add some "git show"-specific code, saying something along the lines
of "if --no-patch is specified, then also imply "--diff-merges=off".
This feels like the safer option / less likely to have side-effects.

2. Add some post-processing to the generic "git log & git show"
options parsing, to generically propagate "--no-patch" into other
properties like those set by "--diff-merges=combined"

I don't feel confident enough with the code here to try for the second
approach, but the first looks like something I should be able to
propose a patch for - and in the meantime I know how to get the
"single-process, many arbitrary commits" performance benefit of "git
show" again. Thanks for sparking the exploration down this little
rabbit-hole!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG?] Major performance issue with some commands on our repo's master branch
  2022-06-05 10:46   ` Tassilo Horn
  2022-06-06  5:18     ` Tao Klerks
@ 2022-06-08 23:36     ` Jeff King
  2022-06-09  1:27       ` Kyle Meyer
  2022-06-09  5:51       ` Tassilo Horn
  1 sibling, 2 replies; 12+ messages in thread
From: Jeff King @ 2022-06-08 23:36 UTC (permalink / raw)
  To: Tassilo Horn; +Cc: Tao Klerks, git

On Sun, Jun 05, 2022 at 12:46:15PM +0200, Tassilo Horn wrote:

> Still maybe someone might want to have a look at the "git show" issue to
> double-check if the performance burden in this specific case (no diff
> should be generated) is warranted.  But at least I can work again with
> no coffee-break long pauses, so I'm all satisfied. :-)

I suspect the issue may be quite subtle. Even you asked for
"--no-patch", the underlying diff may still be used for other things.
For example, simplifying away TREESAME commits. I.e., ones which did not
change anything from their parents after applying path restrictions,
diff-filters, etc. There may be other cases, too (e.g., --follow).

I think the code could be written to realize that none of those features
are in use, and disable the diff entirely in favor of checking whether
the two trees has the same object id. That would yield _mostly_ the same
behavior, though there are probably corner cases (e.g., a tree with an
odd mode entry, say, may get parsed so as to produce an empty diff, even
though it's not byte for byte identical). That may be an acceptable
tradeoff. But I think the code would be a bit brittle (it needs to know
about all the cases where a diff might matter, and we may add more
later).

In general, I think Git assumes that tree-level diffs aren't too painful
to produce. "git log" will do them, too, but just doesn't tickle your
particular case because it doesn't look at merges. So probably setting
diff.renamelimit correctly is not that bad a solution.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG?] Major performance issue with some commands on our repo's master branch
  2022-06-08 23:36     ` Jeff King
@ 2022-06-09  1:27       ` Kyle Meyer
  2022-06-09 15:03         ` Jeff King
  2022-06-09  5:51       ` Tassilo Horn
  1 sibling, 1 reply; 12+ messages in thread
From: Kyle Meyer @ 2022-06-09  1:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Tassilo Horn, Tao Klerks, git

Jeff King writes:

> I suspect the issue may be quite subtle. Even you asked for
> "--no-patch", the underlying diff may still be used for other things.
> For example, simplifying away TREESAME commits. I.e., ones which did not
> change anything from their parents after applying path restrictions,
> diff-filters, etc. There may be other cases, too (e.g., --follow).
>
> I think the code could be written to realize that none of those features
> are in use, and disable the diff entirely in favor of checking whether
> the two trees has the same object id. That would yield _mostly_ the same
> behavior, though there are probably corner cases (e.g., a tree with an
> odd mode entry, say, may get parsed so as to produce an empty diff, even
> though it's not byte for byte identical). That may be an acceptable
> tradeoff. But I think the code would be a bit brittle (it needs to know
> about all the cases where a diff might matter, and we may add more
> later).

Do you think it'd be safe to make --no-patch imply --diff-merges=off, as
Tao suggested elsewhere in this thread?

  https://lore.kernel.org/git/CAPMMpog-7eDOrgSU9GjV4G9rk5RkL-PJhaUAO3_0p2YxfRf=LA@mail.gmail.com

If so, it seems like that'd be a good way to get speedups for some merge
commits.  For example, here are hyperfine timings for the current tip of
git.git's master branch:

  Benchmark #1: git show --no-patch --format=%h 1e59178e3f
    Time (mean ± σ):      47.8 ms ±   1.5 ms    [User: 43.2 ms, System: 4.6 ms]
    Range (min … max):    46.8 ms …  54.4 ms    59 runs
   
    Warning: Statistical outliers were detected. Consider re-running
    this benchmark on a quiet PC without any interferences from other
    programs. It might help to use the '--warmup' or '--prepare'
    options.
   
  Benchmark #2: git show --no-patch --diff-merges=off --format=%h 1e59178e3f
    Time (mean ± σ):       3.2 ms ±   0.2 ms    [User: 2.5 ms, System: 0.8 ms]
    Range (min … max):     2.9 ms …   6.8 ms    688 runs
   
    Warning: Command took less than 5 ms to complete. Results might be
    inaccurate.
    
    Warning: Statistical outliers were detected. Consider [...]
    options.
   
  Benchmark #3: git log --no-walk --format=%h 1e59178e3f
    Time (mean ± σ):       3.2 ms ±   0.1 ms    [User: 2.4 ms, System: 0.8 ms]
    Range (min … max):     2.9 ms …   4.2 ms    697 runs
   
    Warning: Command took less than 5 ms to complete. Results might [...]
    
    Warning: Statistical outliers were detected. Consider [...]
    
   
  Summary
    'git log --no-walk --format=%h 1e59178e3f' ran
      1.01 ± 0.08 times faster than 'git show --no-patch --diff-merges=off --format=%h 1e59178e3f'
     14.98 ± 0.79 times faster than 'git show --no-patch --format=%h 1e59178e3f'

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG?] Major performance issue with some commands on our repo's master branch
  2022-06-08 23:36     ` Jeff King
  2022-06-09  1:27       ` Kyle Meyer
@ 2022-06-09  5:51       ` Tassilo Horn
  2022-06-09 15:05         ` Jeff King
  1 sibling, 1 reply; 12+ messages in thread
From: Tassilo Horn @ 2022-06-09  5:51 UTC (permalink / raw)
  To: Jeff King; +Cc: Tao Klerks, git

Jeff King <peff@peff.net> writes:

Hi Jeff,

>> Still maybe someone might want to have a look at the "git show" issue
>> to double-check if the performance burden in this specific case (no
>> diff should be generated) is warranted.  But at least I can work
>> again with no coffee-break long pauses, so I'm all satisfied. :-)
>
> I suspect the issue may be quite subtle. Even you asked for
> "--no-patch", the underlying diff may still be used for other things.
> For example, simplifying away TREESAME commits. I.e., ones which did
> not change anything from their parents after applying path
> restrictions, diff-filters, etc. There may be other cases, too (e.g.,
> --follow).

I see.  In the end, my issue was solved by my git porcelain switching to
a "git log" incarnation instead of "git show".  When I "git show"
manually, it's no big deal if it takes some time for merge commits.

> [...]
>
> So probably setting diff.renamelimit correctly is not that bad a
> solution.

Does your statement imply diff.renameLimit = 10000 is an incorrect
setting?  The thing is that I mostly work with java codebases where
every file rename implies a change in file contents, too.  A large
renameLimit seems to help in correctly detecting renames/copies although
I don't have empirical data but only gut feeling.

Bye,
Tassilo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG?] Major performance issue with some commands on our repo's master branch
  2022-06-09  1:27       ` Kyle Meyer
@ 2022-06-09 15:03         ` Jeff King
  2022-06-09 18:23           ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2022-06-09 15:03 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: Tassilo Horn, Tao Klerks, git

On Wed, Jun 08, 2022 at 09:27:43PM -0400, Kyle Meyer wrote:

> > I think the code could be written to realize that none of those features
> > are in use, and disable the diff entirely in favor of checking whether
> > the two trees has the same object id. That would yield _mostly_ the same
> > behavior, though there are probably corner cases (e.g., a tree with an
> > odd mode entry, say, may get parsed so as to produce an empty diff, even
> > though it's not byte for byte identical). That may be an acceptable
> > tradeoff. But I think the code would be a bit brittle (it needs to know
> > about all the cases where a diff might matter, and we may add more
> > later).
> 
> Do you think it'd be safe to make --no-patch imply --diff-merges=off, as
> Tao suggested elsewhere in this thread?
> 
>   https://lore.kernel.org/git/CAPMMpog-7eDOrgSU9GjV4G9rk5RkL-PJhaUAO3_0p2YxfRf=LA@mail.gmail.com

I'm not sure. If I do:

diff --git a/diff-merges.c b/diff-merges.c
index 7f64156b8b..f05a585dfb 100644
--- a/diff-merges.c
+++ b/diff-merges.c
@@ -164,6 +164,9 @@ void diff_merges_set_dense_combined_if_unset(struct rev_info *revs)
 
 void diff_merges_setup_revs(struct rev_info *revs)
 {
+	if (revs->diffopt.output_format == DIFF_FORMAT_NO_OUTPUT &&
+	    !revs->explicit_diff_merges)
+		diff_merges_suppress(revs);
 	if (revs->combine_merges == 0)
 		revs->dense_combined_merges = 0;
 	if (revs->separate_merges == 0)

then the test suite passes, but that may just be because we are not
invoking the right corner case. It does change the output with something
like:

  git show --diff-filter=D -s a6434bc6f7a1

Without the patch above, it always shows the commit. With it, it shows
nothing. That's a bit far-fetched, but it is a regression, and I'm also
not sure if it's just the tip of the iceberg.

It also doesn't solve problem completely. Regular commits can have
expensive diffs, too.

I think you'd do better to have a mode specific to git-show that skips
the diff if we're not showing it, but makes sure to always show the
commit anyway. Perhaps something like the hunk above, but put into
cmd_show(), and then setting revs->always_show_header. But it would
require somebody verifying that this does the right thing in all cases.

-Peff

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [BUG?] Major performance issue with some commands on our repo's master branch
  2022-06-09  5:51       ` Tassilo Horn
@ 2022-06-09 15:05         ` Jeff King
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff King @ 2022-06-09 15:05 UTC (permalink / raw)
  To: Tassilo Horn; +Cc: Tao Klerks, git

On Thu, Jun 09, 2022 at 07:51:36AM +0200, Tassilo Horn wrote:

> > So probably setting diff.renamelimit correctly is not that bad a
> > solution.
> 
> Does your statement imply diff.renameLimit = 10000 is an incorrect
> setting?  The thing is that I mostly work with java codebases where
> every file rename implies a change in file contents, too.  A large
> renameLimit seems to help in correctly detecting renames/copies although
> I don't have empirical data but only gut feeling.

Well, for some definition of incorrect. :) You are telling Git to spend
extra time computing renames, and then you were annoyed when it spent a
long time computing renames. So in that sense it was not what you
wanted.

It may be that you want different limits in different contexts, and the
current config is not sufficient to express that.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG?] Major performance issue with some commands on our repo's master branch
  2022-06-09 15:03         ` Jeff King
@ 2022-06-09 18:23           ` Junio C Hamano
  2022-06-09 18:43             ` Jeff King
  0 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2022-06-09 18:23 UTC (permalink / raw)
  To: Jeff King; +Cc: Kyle Meyer, Tassilo Horn, Tao Klerks, git

Jeff King <peff@peff.net> writes:

>   git show --diff-filter=D -s a6434bc6f7a1
>
> Without the patch above, it always shows the commit. With it, it shows
> nothing. That's a bit far-fetched, but it is a regression, and I'm also
> not sure if it's just the tip of the iceberg.

Here "-s" is merely "do not give patch output like we do by
default", so the behaviour is quite understandable and is not a
regression we would want to see happen.  -S/-G are also likely to be
affected, not just the --diff-filter.

> It also doesn't solve problem completely. Regular commits can have
> expensive diffs, too.

That's a good point.

> I think you'd do better to have a mode specific to git-show that skips
> the diff if we're not showing it, but makes sure to always show the
> commit anyway.

Meaning an explicit option "git show --log-only"?  We'd need to
careful to make it either (1) be incompatible with certain features
of "git show" (like giving a pathspec) and error out, or (2) ignore
these features of "git show" silently and document that.  But it
would work as a new option.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG?] Major performance issue with some commands on our repo's master branch
  2022-06-09 18:23           ` Junio C Hamano
@ 2022-06-09 18:43             ` Jeff King
  2022-06-09 20:06               ` Junio C Hamano
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2022-06-09 18:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Kyle Meyer, Tassilo Horn, Tao Klerks, git

On Thu, Jun 09, 2022 at 11:23:58AM -0700, Junio C Hamano wrote:

> > I think you'd do better to have a mode specific to git-show that skips
> > the diff if we're not showing it, but makes sure to always show the
> > commit anyway.
> 
> Meaning an explicit option "git show --log-only"?  We'd need to
> careful to make it either (1) be incompatible with certain features
> of "git show" (like giving a pathspec) and error out, or (2) ignore
> these features of "git show" silently and document that.  But it
> would work as a new option.

Certainly a new option would mean we weren't regressing any existing
behavior. I do think it would be a hard option to explain, though.

But what I wondered is whether "show" in particular, because it would
never want to skip showing a commit, could get away with avoiding the
diff automatically. I.e., currently "git show -Sfoo HEAD" will always
show HEAD, even if "-S" does not match anything. So if we are not
showing any diff output, there is no need to compute the diff in that
case. That is unlike "git log", which would omit commits that didn't
match.

And really it is not "git show" that is special there, but the
always_show_header flag it sets. So something like this might work:

diff --git a/log-tree.c b/log-tree.c
index d0ac0a6327..ed57386938 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -1024,6 +1024,10 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log
 	if (!all_need_diff && !opt->merges_need_diff)
 		return 0;
 
+	if (opt->diffopt.output_format == DIFF_FORMAT_NO_OUTPUT &&
+	    opt->always_show_header)
+		return 0;
+
 	parse_commit_or_die(commit);
 	oid = get_commit_tree_oid(commit);
 

It produces the same output in the cases I tried. And running with
GIT_TRACE2_PERF shows that it doesn't diff and rename code.

I'm not overly confident that it isn't violating some other subtle
assumption / corner case that I haven't thought of, though. :)

-Peff

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [BUG?] Major performance issue with some commands on our repo's master branch
  2022-06-09 18:43             ` Jeff King
@ 2022-06-09 20:06               ` Junio C Hamano
  0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2022-06-09 20:06 UTC (permalink / raw)
  To: Jeff King; +Cc: Kyle Meyer, Tassilo Horn, Tao Klerks, git

Jeff King <peff@peff.net> writes:

> But what I wondered is whether "show" in particular, because it would
> never want to skip showing a commit, could get away with avoiding the
> diff automatically.

Ahh, that is a clever thought.  At least unless we automatically
turn ourselves into "git log" by giving an range, we are naming
individual object we want to see, so why not show them?

But I wonder if "git show A -- P" should or should not show the
commit if A does not touch the path P.  Right now we apply the same
history simplification so "git show master -- t/" gives nothing to
me, which is probably one sensible thing to do.  It is debatable why
somebody who wants to see 'master' wants to hide it when it does not
touch the paths that match the pathspec given, but it can also be
debated why somebody would give a pathspec if commits are to be
hidden when they do not touch paths that match it, so...

> I.e., currently "git show -Sfoo HEAD" will always
> show HEAD, even if "-S" does not match anything. So if we are not
> showing any diff output, there is no need to compute the diff in that
> case. That is unlike "git log", which would omit commits that didn't
> match.

OK, you came up with an example that behaves differently.

> And really it is not "git show" that is special there, but the
> always_show_header flag it sets. So something like this might work:

A tempting thought, indeed.

> diff --git a/log-tree.c b/log-tree.c
> index d0ac0a6327..ed57386938 100644
> --- a/log-tree.c
> +++ b/log-tree.c
> @@ -1024,6 +1024,10 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log
>  	if (!all_need_diff && !opt->merges_need_diff)
>  		return 0;
>  
> +	if (opt->diffopt.output_format == DIFF_FORMAT_NO_OUTPUT &&
> +	    opt->always_show_header)
> +		return 0;
> +
>  	parse_commit_or_die(commit);
>  	oid = get_commit_tree_oid(commit);
>  
>
> It produces the same output in the cases I tried. And running with
> GIT_TRACE2_PERF shows that it doesn't diff and rename code.
>
> I'm not overly confident that it isn't violating some other subtle
> assumption / corner case that I haven't thought of, though. :)
>
> -Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-06-09 20:06 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-04  7:39 [BUG?] Major performance issue with some commands on our repo's master branch Tassilo Horn
2022-06-04 20:20 ` Tao Klerks
2022-06-05 10:46   ` Tassilo Horn
2022-06-06  5:18     ` Tao Klerks
2022-06-08 23:36     ` Jeff King
2022-06-09  1:27       ` Kyle Meyer
2022-06-09 15:03         ` Jeff King
2022-06-09 18:23           ` Junio C Hamano
2022-06-09 18:43             ` Jeff King
2022-06-09 20:06               ` Junio C Hamano
2022-06-09  5:51       ` Tassilo Horn
2022-06-09 15:05         ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).