From: Duy Nguyen <pclouds@gmail.com>
To: Matheus Tavares Bernardino <matheus.bernardino@usp.br>
Cc: "Jeff King" <peff@peff.net>,
"Christian Couder" <christian.couder@gmail.com>,
"Thomas Gummerer" <t.gummerer@gmail.com>,
git <git@vger.kernel.org>,
"Оля Тележная" <olyatelezhnaya@gmail.com>,
"Elijah Newren" <newren@gmail.com>,
"Tanushree Tumane" <tanushreetumane@gmail.com>
Subject: Re: Questions on GSoC 2019 Ideas
Date: Tue, 12 Mar 2019 17:02:37 +0700 [thread overview]
Message-ID: <20190312100237.GA20471@ash> (raw)
In-Reply-To: <CAHd-oW4e6CtcaKXbowqZM-pDAEGJxupHwBvFk2veaaYswt0hmQ@mail.gmail.com>
On Mon, Mar 11, 2019 at 09:18:47PM -0300, Matheus Tavares Bernardino wrote:
> I've been thinking on how I could implement a test to estimate the
> lock contention but had no success until now. I wanted to try
> mutrace[2] but couldn't install it; I tried valgrind's drd but it
> didn't seem to report a contention time estimation; And I tried
> measuring the time of "pthread_mutex_lock(&grep_mutex)" but I don't
> know how much significative this value is, since we can't directly
> compare it (the time of many threads at lock) with the overall
> execution time. Do you have an idea on how we could measure the lock
> contention here?
(I'm supposed to be doing something else, but this is more fun :D)
Yeah lock contention is probably hard to measure. But at least we can
measure how much time is blocked by mutex. Something like this [1]
gives me
$ time ~/w/git/temp/git grep --threads=8 abc HEAD >/dev/null
warning: let's have some fun
block_time = 20ms
block_count = 10725
real 0m0,379s
user 0m0,425s
sys 0m0,073s
From this I know "git grep" took 379ms and at least 20ms (probably
including measurement overhead) is wasted on pthread_mutex_lock(). It
does not look that significant, I admit.
> Another thing that is occurring to me right now is whether git-grep,
> as it is implemented today, would really benefit from thread-safe pack
> access. I may have to study the code more, but it seems to me that
> just the producer thread uses pack access.
That producer I think is just handing out assignments to worker
threads. The real work is still done by worker threads.
The entry point to pack access in this code is protected by
grep_read_lock(). I believe the multithread one is in this deep call
chain (I found it out by gdb)
[main thread] grep_oid() -> add_work() ->
[worker thread] grep_source() -> grep_source_1() ->
grep_source_is_binary() -> grep_source_load() ->
grep_source_load_oid() -> read_object_file()
Note that there's another source of pack access, the
lock_and_read_oid_file() in grep_tree(). This is where we unpack tree
objects and traverse to get blob SHA-1.
This code is currently on the main thread (maybe this is what you
found?) so it does not really benefit from multi thread. We could
still add some sort of lookahead queue to inflate tree objects in
advance in parallel, but I don't know how much gain that will be.
One thing I didn't notice is we currently force no threads in the case
we need pack access, e.g. "git grep <regex> <commit>" or "git grep
--cached <regex>". So if you need to experiment, you need to hack it
and remove that restriction. That's the "let's have some fun" code in
[1].
So, assuming this is CPU bottleneck, let's have a look at how CPU is used.
perf record git grep --threads=1 abc HEAD >/dev/null
perf report
shows me the top CPU consumers are
51,16% git libz.so.1.2.11 [.] inflate_fast
19,55% git git [.] bmexec
You need to do some guessing here. But I believe the top call is from
inflating objects (either from packs or from loose objects). The
second one is from regular expression engine.
Since the regex here is very short (I deliberately try Jeff's case
where object access dominates), the CPU is mostly used up for object
inflation. That suggests that if we could spread it out across cores,
the gain could be quite good. There aren't many dependencies in grep
tasks so if we spread the workload on 8 cores, execution time should
be reduced by 7 or 8 times.
For fun, I hacked up a horrible, horrible "thread safe" version [2]
that probably broke 99% of git and made helgrind extremely unhappy, so
these numbers are at best guidelines, but..
$ time ./git grep --threads=1 abc HEAD >/dev/null
real 0m0,253s
user 0m0,225s
sys 0m0,029s
$ time ./git grep --threads=8 abc HEAD >/dev/null
warning: let's have some fun!
real 0m0,157s
user 0m0,312s
sys 0m0,089s
You can see the "real" rows show quite good time reduction. Not 8
times reduction, mind you, but that's where serious people dig in and
really speed it up ;-)
[1] https://gitlab.com/snippets/1834609
[2] https://gitlab.com/snippets/1834613
> So, although it would be out of scope for GSoC, checkout, diff and log
> (and maybe others) could all benefit from a thread-safe/parallel pack
> access, right? If so, it is very motivating the impact this project
> could, in theory, have :)
We have to analyze case by case. It may turn out that there are many
opportunity to utilize multi threads. I think checkout is definitely a
good candidate. For "git diff" and "git log" maybe you can try "perf"
to see how much workload is locked in pack access (mostly inflation,
because it's easier to spot from the profile report)
--
Duy
next prev parent reply other threads:[~2019-03-12 10:02 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-28 21:46 Questions on GSoC 2019 Ideas Matheus Tavares Bernardino
2019-02-28 22:07 ` Christian Couder
2019-03-01 9:30 ` Duy Nguyen
2019-03-02 15:09 ` Thomas Gummerer
2019-03-03 7:18 ` Christian Couder
2019-03-03 10:12 ` Duy Nguyen
2019-03-03 10:17 ` Duy Nguyen
2019-03-05 4:51 ` Jeff King
2019-03-05 12:57 ` Duy Nguyen
2019-03-05 23:46 ` Matheus Tavares Bernardino
2019-03-06 10:17 ` Duy Nguyen
2019-03-12 0:18 ` Matheus Tavares Bernardino
2019-03-12 10:02 ` Duy Nguyen [this message]
2019-03-12 10:11 ` Duy Nguyen
2019-04-04 1:15 ` Matheus Tavares Bernardino
2019-04-04 7:56 ` Christian Couder
2019-04-04 8:20 ` Mike Hommey
2019-04-05 16:28 ` Matheus Tavares Bernardino
2019-04-07 23:40 ` Christian Couder
2019-03-05 23:03 ` Matheus Tavares Bernardino
2019-03-06 23:17 ` Thomas Gummerer
2019-03-03 10:03 ` Duy Nguyen
2019-03-03 16:12 ` Thomas Gummerer
2019-03-01 15:20 ` Johannes Schindelin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190312100237.GA20471@ash \
--to=pclouds@gmail.com \
--cc=christian.couder@gmail.com \
--cc=git@vger.kernel.org \
--cc=matheus.bernardino@usp.br \
--cc=newren@gmail.com \
--cc=olyatelezhnaya@gmail.com \
--cc=peff@peff.net \
--cc=t.gummerer@gmail.com \
--cc=tanushreetumane@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).