git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Duy Nguyen <pclouds@gmail.com>
To: Matheus Tavares Bernardino <matheus.bernardino@usp.br>
Cc: "Jeff King" <peff@peff.net>,
	"Christian Couder" <christian.couder@gmail.com>,
	"Thomas Gummerer" <t.gummerer@gmail.com>,
	git <git@vger.kernel.org>,
	"Оля Тележная" <olyatelezhnaya@gmail.com>,
	"Elijah Newren" <newren@gmail.com>,
	"Tanushree Tumane" <tanushreetumane@gmail.com>
Subject: Re: Questions on GSoC 2019 Ideas
Date: Tue, 12 Mar 2019 17:02:37 +0700	[thread overview]
Message-ID: <20190312100237.GA20471@ash> (raw)
In-Reply-To: <CAHd-oW4e6CtcaKXbowqZM-pDAEGJxupHwBvFk2veaaYswt0hmQ@mail.gmail.com>

On Mon, Mar 11, 2019 at 09:18:47PM -0300, Matheus Tavares Bernardino wrote:
> I've been thinking on how I could implement a test to estimate the
> lock contention but had no success until now. I wanted to try
> mutrace[2] but couldn't install it; I tried valgrind's drd but it
> didn't seem to report a contention time estimation; And I tried
> measuring the time of "pthread_mutex_lock(&grep_mutex)" but I don't
> know how much significative this value is, since we can't directly
> compare it (the time of many threads at lock) with the overall
> execution time. Do you have an idea on how we could measure the lock
> contention here?

(I'm supposed to be doing something else, but this is more fun :D)

Yeah lock contention is probably hard to measure. But at least we can
measure how much time is blocked by mutex. Something like this [1]
gives me

    $ time ~/w/git/temp/git grep --threads=8 abc HEAD >/dev/null
    warning: let's have some fun
    block_time = 20ms
    block_count = 10725
    
    real    0m0,379s
    user    0m0,425s
    sys     0m0,073s

From this I know "git grep" took 379ms and at least 20ms (probably
including measurement overhead) is wasted on pthread_mutex_lock(). It
does not look that significant, I admit.

> Another thing that is occurring to me right now is whether git-grep,
> as it is implemented today, would really benefit from thread-safe pack
> access. I may have to study the code more, but it seems to me that
> just the producer thread uses pack access.

That producer I think is just handing out assignments to worker
threads. The real work is still done by worker threads.

The entry point to pack access in this code is protected by
grep_read_lock(). I believe the multithread one is in this deep call
chain (I found it out by gdb)

[main thread] grep_oid() -> add_work() ->
[worker thread] grep_source() -> grep_source_1() ->
grep_source_is_binary() -> grep_source_load() ->
grep_source_load_oid() -> read_object_file()

Note that there's another source of pack access, the
lock_and_read_oid_file() in grep_tree(). This is where we unpack tree
objects and traverse to get blob SHA-1.

This code is currently on the main thread (maybe this is what you
found?) so it does not really benefit from multi thread. We could
still add some sort of lookahead queue to inflate tree objects in
advance in parallel, but I don't know how much gain that will be.

One thing I didn't notice is we currently force no threads in the case
we need pack access, e.g. "git grep <regex> <commit>" or "git grep
--cached <regex>". So if you need to experiment, you need to hack it
and remove that restriction. That's the "let's have some fun" code in
[1].

So, assuming this is CPU bottleneck, let's have a look at how CPU is used.

    perf record git grep --threads=1 abc HEAD >/dev/null
    perf report

shows me the top CPU consumers are

  51,16%  git      libz.so.1.2.11      [.] inflate_fast
  19,55%  git      git                 [.] bmexec

You need to do some guessing here. But I believe the top call is from
inflating objects (either from packs or from loose objects). The
second one is from regular expression engine.

Since the regex here is very short (I deliberately try Jeff's case
where object access dominates), the CPU is mostly used up for object
inflation. That suggests that if we could spread it out across cores,
the gain could be quite good. There aren't many dependencies in grep
tasks so if we spread the workload on 8 cores, execution time should
be reduced by 7 or 8 times.

For fun, I hacked up a horrible, horrible "thread safe" version [2]
that probably broke 99% of git and made helgrind extremely unhappy, so
these numbers are at best guidelines, but..

    $ time ./git grep --threads=1 abc HEAD >/dev/null
    
    real    0m0,253s
    user    0m0,225s
    sys     0m0,029s
    
    $ time ./git grep --threads=8 abc HEAD >/dev/null
    warning: let's have some fun!
    
    real    0m0,157s
    user    0m0,312s
    sys     0m0,089s

You can see the "real" rows show quite good time reduction. Not 8
times reduction, mind you, but that's where serious people dig in and
really speed it up ;-)

[1] https://gitlab.com/snippets/1834609
[2] https://gitlab.com/snippets/1834613

> So, although it would be out of scope for GSoC, checkout, diff and log
> (and maybe others) could all benefit from a thread-safe/parallel pack
> access, right? If so, it is very motivating the impact this project
> could, in theory, have :)

We have to analyze case by case. It may turn out that there are many
opportunity to utilize multi threads. I think checkout is definitely a
good candidate. For "git diff" and "git log" maybe you can try "perf"
to see how much workload is locked in pack access (mostly inflation,
because it's easier to spot from the profile report)
--
Duy

  reply	other threads:[~2019-03-12 10:02 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-28 21:46 Questions on GSoC 2019 Ideas Matheus Tavares Bernardino
2019-02-28 22:07 ` Christian Couder
2019-03-01  9:30   ` Duy Nguyen
2019-03-02 15:09     ` Thomas Gummerer
2019-03-03  7:18       ` Christian Couder
2019-03-03 10:12         ` Duy Nguyen
2019-03-03 10:17           ` Duy Nguyen
2019-03-05  4:51           ` Jeff King
2019-03-05 12:57             ` Duy Nguyen
2019-03-05 23:46               ` Matheus Tavares Bernardino
2019-03-06 10:17                 ` Duy Nguyen
2019-03-12  0:18                   ` Matheus Tavares Bernardino
2019-03-12 10:02                     ` Duy Nguyen [this message]
2019-03-12 10:11                       ` Duy Nguyen
2019-04-04  1:15                         ` Matheus Tavares Bernardino
2019-04-04  7:56                           ` Christian Couder
2019-04-04  8:20                             ` Mike Hommey
2019-04-05 16:28                             ` Matheus Tavares Bernardino
2019-04-07 23:40                               ` Christian Couder
2019-03-05 23:03         ` Matheus Tavares Bernardino
2019-03-06 23:17           ` Thomas Gummerer
2019-03-03 10:03       ` Duy Nguyen
2019-03-03 16:12         ` Thomas Gummerer
2019-03-01 15:20   ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190312100237.GA20471@ash \
    --to=pclouds@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=matheus.bernardino@usp.br \
    --cc=newren@gmail.com \
    --cc=olyatelezhnaya@gmail.com \
    --cc=peff@peff.net \
    --cc=t.gummerer@gmail.com \
    --cc=tanushreetumane@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).