git-status performance with submodules

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* git-status performance with submodules
@ 2019-12-02  6:19 D. Ben Knoble
  2019-12-02  6:50 ` Junio C Hamano
  0 siblings, 1 reply; 3+ messages in thread
From: D. Ben Knoble @ 2019-12-02  6:19 UTC (permalink / raw)
  To: git

[If this has already gone through multiple times, I apologize for the
repetition; I have had a hard time getting GMail to send this. Past
versions had attachments, which I believe contributed to failures.
This one has none, but has links to all the content.]

Hello all,

I have a concern about the performance of git-status with many (~38)
submodules. As part of a (large-scale) system dynamics class, I was tasked
with identifying a performance problem, tracing it using KUTrace(2)[3], and
subsequently investigating it. I ended up with some unique observations about
git-status and submodules[2].

The interactive HTML traces are available on Google Drive[4][5].

I won't recreate all the details here, but I would encourage you to play with
the traces, or at least go through the slides.

### The short-version

Git status is slow(3).

### Baseline

- time git-status, with many submodules, and --ignore-submodules=none
    0.497s
- time git-status in non-submodule heavy repos
    0.014s

### What I consider a temporary fix

- time git-status, with many submodules, and --ignore-submodules=all
    0.026s

### What I would like to see

I would like to improve the git-status performance with this many submodules,
so that I can remove diff.ignoreSubmodules=none from my config (it is useful
information, and the flag affects many commands). I would be willing to work
on a discussed and designed fix.

### What I am curious about

From the traces (attached), it appears that git-status suffers from a lack of
(possibly embarrassing) parallelism: I would expect each submodule to be
independently check-able, but the process section of the trace has them
executing serially (for reasons unknown to me). The apparent need to fork/exec
many processes in this way appears to also be a source of latency, along with
the very large number of filesystem-related syscalls (if my understanding is
correct).

What can we do to fix this? Is there a reason for this (really terribly slow)
serial execution? Is this something developers haven't bothered to optimize
("unexpected use case")? If so, I would like to discuss taking a crack at it,
because I do have at least one repository with this many submodules, and I
care about its performance.

---

Notes

1) All timings were taken with the https://github.com/benknoble/Dotfiles repo
from around commit da194a8f4104a9fc74e8895ebc8512434f07d393

2) KUTrace is a set of kernel patches and userspace programs that provide
low-overhead tracing, as well as post-processing those traces

3) Timings taken on my machine (2012 macbook pro; can provide more details if
requested)

---

Links

[1]: https://docs.google.com/presentation/d/1z-6ffE9KY-Jswl2BiWzYV2DG6fOutgWSi_aZ5uql__s/edit?usp=sharing
[2]: https://benknoble.github.io/blog/2019/11/07/git-stat/
[3]: https://github.com/dicksites/KUtrace
[4]: https://drive.google.com/file/d/1JyYO420yWp7XvNJJ8HLOPU0o6mesSKZf/view?usp=sharing
[5]: https://drive.google.com/file/d/1BqqxH0PRCYz_vvYkBBFpbL5dkFTLPyuK/view?usp=sharing

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: git-status performance with submodules
  2019-12-02  6:19 git-status performance with submodules D. Ben Knoble
@ 2019-12-02  6:50 ` Junio C Hamano
  2019-12-02 14:05   ` Jeff King
  0 siblings, 1 reply; 3+ messages in thread
From: Junio C Hamano @ 2019-12-02  6:50 UTC (permalink / raw)
  To: D. Ben Knoble; +Cc: git

"D. Ben Knoble" <ben.knoble@gmail.com> writes:

> ### What I am curious about
>
> From the traces (attached), it appears that git-status suffers from a lack of
> (possibly embarrassing) parallelism: I would expect each submodule to be
> independently check-able, ...
> ...
> What can we do to fix this? Is there a reason for this (really terribly slow)
> serial execution? Is this something developers haven't bothered to optimize
> ("unexpected use case")? If so, I would like to discuss taking a crack at it,
> because I do have at least one repository with this many submodules, and I
> care about its performance.

Nice to hear from somebody who cares about improving submodule
support.  I offhand do not think of a reason why we inherently have
to process them serially.

But the way "git status" code is structured, it probably takes a bit
of preparatory refactoring.  If I recall correctly, it walks each
path in the index in the superproject and notes how the file in the
working tree is different from that of the index and the HEAD, under
the assumption that inspection of each path is relatively cheap and
at the same cost.  You'd first need to restructure that part so that
inspecting groups of index entries can be sharded to separate
subprocesses while the parent process waits, and have them report to
the parent process, and let the parent process continue with the
aggregated result, or something like that.

Thanks.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: git-status performance with submodules
  2019-12-02  6:50 ` Junio C Hamano
@ 2019-12-02 14:05   ` Jeff King
  0 siblings, 0 replies; 3+ messages in thread
From: Jeff King @ 2019-12-02 14:05 UTC (permalink / raw)
  To: D. Ben Knoble; +Cc: Junio C Hamano, git

On Sun, Dec 01, 2019 at 10:50:29PM -0800, Junio C Hamano wrote:

> But the way "git status" code is structured, it probably takes a bit
> of preparatory refactoring.  If I recall correctly, it walks each
> path in the index in the superproject and notes how the file in the
> working tree is different from that of the index and the HEAD, under
> the assumption that inspection of each path is relatively cheap and
> at the same cost.  You'd first need to restructure that part so that
> inspecting groups of index entries can be sharded to separate
> subprocesses while the parent process waits, and have them report to
> the parent process, and let the parent process continue with the
> aggregated result, or something like that.

There's some prior art for this approach in git-checkout, where we have
a similar problem with latency of filters (e.g., for LFS). There the
individual status for a path becomes a tri-state: success, error, or
deferred. And then we collect the results from the deferred ones in a
loop.

I think (but didn't look carefully) that this could be slotted into the
diff code pretty easily. After the tree-level diff we have a queue of
candidates in memory. At that point we should be able to kick off a
process in parallel for each submodule, then wait for them all to finish
before proceeding. Maybe even as a stage of diffcore_std(), but I'm not
sure.

(Hand-wavey, I know, but just trying to point interested folks in the
right direction).

-Peff

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-12-02 14:05 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-02  6:19 git-status performance with submodules D. Ben Knoble
2019-12-02  6:50 ` Junio C Hamano
2019-12-02 14:05   ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).