Questions on GSoC 2019 Ideas

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Questions on GSoC 2019 Ideas
@ 2019-02-28 21:46 Matheus Tavares Bernardino
  2019-02-28 22:07 ` Christian Couder
  0 siblings, 1 reply; 24+ messages in thread
From: Matheus Tavares Bernardino @ 2019-02-28 21:46 UTC (permalink / raw)
  To: git
  Cc: Thomas Gummerer,
	Оля Тележная,
	Elijah Newren, Christian Couder

Hi, everyone

I've been in the mailing list for a couple weeks now, mainly working
on my gsoc micro-project[1] and in other patches that derived from it.
I also have been contributing to the Linux Kernel for half an year,
but am now mainly just supporting other students here at USP.

I have read the ideas page for the GSoC 2019 and many of them interest
me. Also, looking around git-dev materials on the web, I got to the
GSoC 2012 ideas page. And this one got my attention:
https://github.com/peff/git/wiki/SoC-2012-Ideas#improving-parallelism-in-various-commands

I'm interested in parallel computing and that has been my research
topic for about an year now. So I would like to ask what's the status
of this GSoC idea. I've read git-grep and saw that it is already
parallel, but I was wondering if there is any other section in git in
which it was already considered to bring parallelism, seeking to
achieve greater performance. And also, if this could, perhaps, be a
GSoC project.

I'm also interested in the 'convert script to builtin' task, and I was
wondering if 'git-bisect' would be a good candidate for that.

Thanks,
Matheus Tavares

[1]: https://public-inbox.org/git/20190226051804.10631-1-matheus.bernardino@usp.br/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-02-28 21:46 Questions on GSoC 2019 Ideas Matheus Tavares Bernardino
@ 2019-02-28 22:07 ` Christian Couder
  2019-03-01  9:30   ` Duy Nguyen
  2019-03-01 15:20   ` Johannes Schindelin
  0 siblings, 2 replies; 24+ messages in thread
From: Christian Couder @ 2019-02-28 22:07 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: git, Thomas Gummerer,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

Hi Matheus,

On Thu, Feb 28, 2019 at 10:46 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> I've been in the mailing list for a couple weeks now, mainly working
> on my gsoc micro-project[1] and in other patches that derived from it.
> I also have been contributing to the Linux Kernel for half an year,
> but am now mainly just supporting other students here at USP.
>
> I have read the ideas page for the GSoC 2019 and many of them interest
> me. Also, looking around git-dev materials on the web, I got to the
> GSoC 2012 ideas page. And this one got my attention:
> https://github.com/peff/git/wiki/SoC-2012-Ideas#improving-parallelism-in-various-commands
>
> I'm interested in parallel computing and that has been my research
> topic for about an year now. So I would like to ask what's the status
> of this GSoC idea. I've read git-grep and saw that it is already
> parallel, but I was wondering if there is any other section in git in
> which it was already considered to bring parallelism, seeking to
> achieve greater performance. And also, if this could, perhaps, be a
> GSoC project.

I vaguely remember that we thought at one point that all the low
hanging fruits had already been taken in this area but I might be
wrong.

> I'm also interested in the 'convert script to builtin' task, and I was
> wondering if 'git-bisect' would be a good candidate for that.

There is an ongoing work on that by Tanushree Tumane, an Outreachy
intern. The end of the internship is March 5, but it seems to me that
there is not a lot of work left on the project, and hopefully
Tanushree or her mentors will take care of that.

Best,
Christian.

> Thanks,
> Matheus Tavares
>
> [1]: https://public-inbox.org/git/20190226051804.10631-1-matheus.bernardino@usp.br/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-02-28 22:07 ` Christian Couder
@ 2019-03-01  9:30   ` Duy Nguyen
  2019-03-02 15:09     ` Thomas Gummerer
  2019-03-01 15:20   ` Johannes Schindelin
  1 sibling, 1 reply; 24+ messages in thread
From: Duy Nguyen @ 2019-03-01  9:30 UTC (permalink / raw)
  To: Christian Couder
  Cc: Matheus Tavares Bernardino, git, Thomas Gummerer,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

On Fri, Mar 1, 2019 at 5:20 AM Christian Couder
<christian.couder@gmail.com> wrote:
>
> Hi Matheus,
>
> On Thu, Feb 28, 2019 at 10:46 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > I've been in the mailing list for a couple weeks now, mainly working
> > on my gsoc micro-project[1] and in other patches that derived from it.
> > I also have been contributing to the Linux Kernel for half an year,
> > but am now mainly just supporting other students here at USP.
> >
> > I have read the ideas page for the GSoC 2019 and many of them interest
> > me. Also, looking around git-dev materials on the web, I got to the
> > GSoC 2012 ideas page. And this one got my attention:
> > https://github.com/peff/git/wiki/SoC-2012-Ideas#improving-parallelism-in-various-commands
> >
> > I'm interested in parallel computing and that has been my research
> > topic for about an year now. So I would like to ask what's the status
> > of this GSoC idea. I've read git-grep and saw that it is already
> > parallel, but I was wondering if there is any other section in git in
> > which it was already considered to bring parallelism, seeking to
> > achieve greater performance. And also, if this could, perhaps, be a
> > GSoC project.
>
> I vaguely remember that we thought at one point that all the low
> hanging fruits had already been taken in this area but I might be
> wrong.

We still have to remove some global variables, which is quite easy to
do, before one could actually add mutexes and stuff to allow multiple
pack access. I don't know though if the removing global variables is
that exciting for GSoC, or if both tasks could fit in one GSoC. The
adding parallel access is not that hard, I think, once you know
packfile.c and sha1-file.c relatively well. It's mostly dealing with
caches and all the sliding access windows safely.
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-02-28 22:07 ` Christian Couder
  2019-03-01  9:30   ` Duy Nguyen
@ 2019-03-01 15:20   ` Johannes Schindelin
  1 sibling, 0 replies; 24+ messages in thread
From: Johannes Schindelin @ 2019-03-01 15:20 UTC (permalink / raw)
  To: Christian Couder
  Cc: Matheus Tavares Bernardino, git, Thomas Gummerer,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

Hi Chris,

On Thu, 28 Feb 2019, Christian Couder wrote:

> On Thu, Feb 28, 2019 at 10:46 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > I'm also interested in the 'convert script to builtin' task, and I was
> > wondering if 'git-bisect' would be a good candidate for that.
> 
> There is an ongoing work on that by Tanushree Tumane, an Outreachy
> intern. The end of the internship is March 5, but it seems to me that
> there is not a lot of work left on the project, and hopefully
> Tanushree or her mentors will take care of that.

By "or their mentors" you are referring to yourself and to myself.

And indeed, there won'y be much to be done to convert the script to a full
built-in. But there were a couple of things that Tanushree also had on
their radar: Fix how git bisect handle many merge bases.

I think that could still be a Google Summer of Code project.

Ciao,
Johannes

> 
> Best,
> Christian.
> 
> > Thanks,
> > Matheus Tavares
> >
> > [1]: https://public-inbox.org/git/20190226051804.10631-1-matheus.bernardino@usp.br/
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-01  9:30   ` Duy Nguyen
@ 2019-03-02 15:09     ` Thomas Gummerer
  2019-03-03  7:18       ` Christian Couder
  2019-03-03 10:03       ` Duy Nguyen
  0 siblings, 2 replies; 24+ messages in thread
From: Thomas Gummerer @ 2019-03-02 15:09 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Christian Couder, Matheus Tavares Bernardino, git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

On 03/01, Duy Nguyen wrote:
> On Fri, Mar 1, 2019 at 5:20 AM Christian Couder
> <christian.couder@gmail.com> wrote:
> >
> > Hi Matheus,
> >
> > On Thu, Feb 28, 2019 at 10:46 PM Matheus Tavares Bernardino
> > <matheus.bernardino@usp.br> wrote:
> > >
> > > I've been in the mailing list for a couple weeks now, mainly working
> > > on my gsoc micro-project[1] and in other patches that derived from it.
> > > I also have been contributing to the Linux Kernel for half an year,
> > > but am now mainly just supporting other students here at USP.
> > >
> > > I have read the ideas page for the GSoC 2019 and many of them interest
> > > me. Also, looking around git-dev materials on the web, I got to the
> > > GSoC 2012 ideas page. And this one got my attention:
> > > https://github.com/peff/git/wiki/SoC-2012-Ideas#improving-parallelism-in-various-commands
> > >
> > > I'm interested in parallel computing and that has been my research
> > > topic for about an year now. So I would like to ask what's the status
> > > of this GSoC idea. I've read git-grep and saw that it is already
> > > parallel, but I was wondering if there is any other section in git in
> > > which it was already considered to bring parallelism, seeking to
> > > achieve greater performance. And also, if this could, perhaps, be a
> > > GSoC project.
> >
> > I vaguely remember that we thought at one point that all the low
> > hanging fruits had already been taken in this area but I might be
> > wrong.
> 
> We still have to remove some global variables, which is quite easy to
> do, before one could actually add mutexes and stuff to allow multiple
> pack access. I don't know though if the removing global variables is
> that exciting for GSoC, or if both tasks could fit in one GSoC. The
> adding parallel access is not that hard, I think, once you know
> packfile.c and sha1-file.c relatively well. It's mostly dealing with
> caches and all the sliding access windows safely.

I'm not very familiar with what's required here, but reading the above
makes me think it's likely too much for a GSoC project.  I think I'd
be happy with a project that declares removing the global variables as
the main goal, and adding parallelism as a potential bonus.

I'm a bit wary of a too large proposal here, as we've historically
overestimated what kind of project is achievable over a summer (I've
been there myself, as my GSoC project was also more than I was able to
do in a summer :)).  I'd rather have a project whose goal is rather
small and can be expanded later, than having something that could
potentially take more than 3 months, where the student (or their
mentors) have to finish it after GSoC.

> -- 
> Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-02 15:09     ` Thomas Gummerer
@ 2019-03-03  7:18       ` Christian Couder
  2019-03-03 10:12         ` Duy Nguyen
  2019-03-05 23:03         ` Matheus Tavares Bernardino
  2019-03-03 10:03       ` Duy Nguyen
  1 sibling, 2 replies; 24+ messages in thread
From: Christian Couder @ 2019-03-03  7:18 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Duy Nguyen, Matheus Tavares Bernardino, git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

On Sat, Mar 2, 2019 at 4:09 PM Thomas Gummerer <t.gummerer@gmail.com> wrote:
>
> On 03/01, Duy Nguyen wrote:
> > On Fri, Mar 1, 2019 at 5:20 AM Christian Couder
> > <christian.couder@gmail.com> wrote:
> > >
> > > Hi Matheus,
> > >
> > > On Thu, Feb 28, 2019 at 10:46 PM Matheus Tavares Bernardino
> > > <matheus.bernardino@usp.br> wrote:
> > > >
> > > > I've been in the mailing list for a couple weeks now, mainly working
> > > > on my gsoc micro-project[1] and in other patches that derived from it.
> > > > I also have been contributing to the Linux Kernel for half an year,
> > > > but am now mainly just supporting other students here at USP.
> > > >
> > > > I have read the ideas page for the GSoC 2019 and many of them interest
> > > > me. Also, looking around git-dev materials on the web, I got to the
> > > > GSoC 2012 ideas page. And this one got my attention:
> > > > https://github.com/peff/git/wiki/SoC-2012-Ideas#improving-parallelism-in-various-commands
> > > >
> > > > I'm interested in parallel computing and that has been my research
> > > > topic for about an year now. So I would like to ask what's the status
> > > > of this GSoC idea. I've read git-grep and saw that it is already
> > > > parallel, but I was wondering if there is any other section in git in
> > > > which it was already considered to bring parallelism, seeking to
> > > > achieve greater performance. And also, if this could, perhaps, be a
> > > > GSoC project.
> > >
> > > I vaguely remember that we thought at one point that all the low
> > > hanging fruits had already been taken in this area but I might be
> > > wrong.
> >
> > We still have to remove some global variables, which is quite easy to
> > do, before one could actually add mutexes and stuff to allow multiple
> > pack access. I don't know though if the removing global variables is
> > that exciting for GSoC, or if both tasks could fit in one GSoC. The
> > adding parallel access is not that hard, I think, once you know
> > packfile.c and sha1-file.c relatively well. It's mostly dealing with
> > caches and all the sliding access windows safely.
>
> I'm not very familiar with what's required here, but reading the above
> makes me think it's likely too much for a GSoC project.  I think I'd
> be happy with a project that declares removing the global variables as
> the main goal, and adding parallelism as a potential bonus.

Yeah, I think that the main issue, now that Duy found something that
could be a GSoC project, is that the potential mentors are not
familiar with the pack access code. It means that Matheus would
probably not get a lot of help from his mentors when he would work on
adding parallelism.

That may not be too big a problem though if Matheus is ok to ask many
technical questions on the mailing list. It seems to me that he could
succeed.

> I'm a bit wary of a too large proposal here, as we've historically
> overestimated what kind of project is achievable over a summer (I've
> been there myself, as my GSoC project was also more than I was able to
> do in a summer :)).  I'd rather have a project whose goal is rather
> small and can be expanded later, than having something that could
> potentially take more than 3 months, where the student (or their
> mentors) have to finish it after GSoC.

Yeah, I agree with your suggestion about a project that declares
removing the global variables as the main goal, and adding parallelism
as a potential bonus.

One thing I am still worried about is if we are sure that adding
parallelism is likely to get us a significant performance improvement
or not. If the performance of this code is bounded by disk or memory
access, then adding parallelism might not bring any benefit. (It could
perhaps decrease performance if memory locality gets worse.) So I'd
like some confirmation either by running some tests or by experienced
Git developers that it is likely to be a win.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-02 15:09     ` Thomas Gummerer
  2019-03-03  7:18       ` Christian Couder
@ 2019-03-03 10:03       ` Duy Nguyen
  2019-03-03 16:12         ` Thomas Gummerer
  1 sibling, 1 reply; 24+ messages in thread
From: Duy Nguyen @ 2019-03-03 10:03 UTC (permalink / raw)
  To: Thomas Gummerer
  Cc: Christian Couder, Matheus Tavares Bernardino, git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

On Sat, Mar 2, 2019 at 10:09 PM Thomas Gummerer <t.gummerer@gmail.com> wrote:
>
> On 03/01, Duy Nguyen wrote:
> > On Fri, Mar 1, 2019 at 5:20 AM Christian Couder
> > <christian.couder@gmail.com> wrote:
> > >
> > > Hi Matheus,
> > >
> > > On Thu, Feb 28, 2019 at 10:46 PM Matheus Tavares Bernardino
> > > <matheus.bernardino@usp.br> wrote:
> > > >
> > > > I've been in the mailing list for a couple weeks now, mainly working
> > > > on my gsoc micro-project[1] and in other patches that derived from it.
> > > > I also have been contributing to the Linux Kernel for half an year,
> > > > but am now mainly just supporting other students here at USP.
> > > >
> > > > I have read the ideas page for the GSoC 2019 and many of them interest
> > > > me. Also, looking around git-dev materials on the web, I got to the
> > > > GSoC 2012 ideas page. And this one got my attention:
> > > > https://github.com/peff/git/wiki/SoC-2012-Ideas#improving-parallelism-in-various-commands
> > > >
> > > > I'm interested in parallel computing and that has been my research
> > > > topic for about an year now. So I would like to ask what's the status
> > > > of this GSoC idea. I've read git-grep and saw that it is already
> > > > parallel, but I was wondering if there is any other section in git in
> > > > which it was already considered to bring parallelism, seeking to
> > > > achieve greater performance. And also, if this could, perhaps, be a
> > > > GSoC project.
> > >
> > > I vaguely remember that we thought at one point that all the low
> > > hanging fruits had already been taken in this area but I might be
> > > wrong.
> >
> > We still have to remove some global variables, which is quite easy to
> > do, before one could actually add mutexes and stuff to allow multiple
> > pack access. I don't know though if the removing global variables is
> > that exciting for GSoC, or if both tasks could fit in one GSoC. The
> > adding parallel access is not that hard, I think, once you know
> > packfile.c and sha1-file.c relatively well. It's mostly dealing with
> > caches and all the sliding access windows safely.
>
> I'm not very familiar with what's required here, but reading the above
> makes me think it's likely too much for a GSoC project.  I think I'd
> be happy with a project that declares removing the global variables as
> the main goal, and adding parallelism as a potential bonus.
>
> I'm a bit wary of a too large proposal here, as we've historically
> overestimated what kind of project is achievable over a summer (I've
> been there myself, as my GSoC project was also more than I was able to
> do in a summer :)).  I'd rather have a project whose goal is rather
> small and can be expanded later, than having something that could
> potentially take more than 3 months, where the student (or their
> mentors) have to finish it after GSoC.

This is why I'm not involved in GSoC. I often mis-estimate the size of
work (and yes I would still like your tree-based index format in,
can't remember why it never made it).

So yeah if you find removing global variables (which essentially
identifies shared states, a prerequisite for any parallel work)
reasonable for GSoC, I'd say go for it.

Be also aware that this kind of refactoring work could result in lots
of patches and it takes time to get them merged, if your GSoC goal is
to get merged.
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-03  7:18       ` Christian Couder
@ 2019-03-03 10:12         ` Duy Nguyen
  2019-03-03 10:17           ` Duy Nguyen
  2019-03-05  4:51           ` Jeff King
  2019-03-05 23:03         ` Matheus Tavares Bernardino
  1 sibling, 2 replies; 24+ messages in thread
From: Duy Nguyen @ 2019-03-03 10:12 UTC (permalink / raw)
  To: Christian Couder
  Cc: Thomas Gummerer, Matheus Tavares Bernardino, git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane, Jeff King

On Sun, Mar 3, 2019 at 2:18 PM Christian Couder
<christian.couder@gmail.com> wrote:
> One thing I am still worried about is if we are sure that adding
> parallelism is likely to get us a significant performance improvement
> or not. If the performance of this code is bounded by disk or memory
> access, then adding parallelism might not bring any benefit. (It could
> perhaps decrease performance if memory locality gets worse.) So I'd
> like some confirmation either by running some tests or by experienced
> Git developers that it is likely to be a win.

This is a good point. My guess is the pack access consists of two
parts: deflate zlib, resolve delta objects (which is just another form
of compression) and actual I/O. The former is CPU bound and may take
advantage of multiple cores. However, the cache we have kinda helps
reduce CPU work load already, so perhaps the actual gain is not that
much (or maybe we could just improve this cache to be more efficient).
I'm adding Jeff, maybe he has done some experiments on parallel pack
access, who knows.

The second good thing from parallel pack access is not about utilizing
processing power from multiple cores, but about _not_ blocking. I
think one example use case here is parallel checkout. While one thread
is blocked by pack access code for whatever reason, the others can
still continue doing other stuff (e.g. write the checked out file to
disk) or even access the pack again to check more things out.
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-03 10:12         ` Duy Nguyen
@ 2019-03-03 10:17           ` Duy Nguyen
  2019-03-05  4:51           ` Jeff King
  1 sibling, 0 replies; 24+ messages in thread
From: Duy Nguyen @ 2019-03-03 10:17 UTC (permalink / raw)
  To: Christian Couder
  Cc: Thomas Gummerer, Matheus Tavares Bernardino, git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane, Jeff King

On Sun, Mar 3, 2019 at 5:12 PM Duy Nguyen <pclouds@gmail.com> wrote:
> This is a good point. My guess is the pack access consists of two
> parts: deflate zlib, resolve delta objects (which is just another form

s/deflate/inflate/ (i keep making this mistake)
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-03 10:03       ` Duy Nguyen
@ 2019-03-03 16:12         ` Thomas Gummerer
  0 siblings, 0 replies; 24+ messages in thread
From: Thomas Gummerer @ 2019-03-03 16:12 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Christian Couder, Matheus Tavares Bernardino, git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

On 03/03, Duy Nguyen wrote:
> On Sat, Mar 2, 2019 at 10:09 PM Thomas Gummerer <t.gummerer@gmail.com> wrote:
> > I'm not very familiar with what's required here, but reading the above
> > makes me think it's likely too much for a GSoC project.  I think I'd
> > be happy with a project that declares removing the global variables as
> > the main goal, and adding parallelism as a potential bonus.
> >
> > I'm a bit wary of a too large proposal here, as we've historically
> > overestimated what kind of project is achievable over a summer (I've
> > been there myself, as my GSoC project was also more than I was able to
> > do in a summer :)).  I'd rather have a project whose goal is rather
> > small and can be expanded later, than having something that could
> > potentially take more than 3 months, where the student (or their
> > mentors) have to finish it after GSoC.
> 
> This is why I'm not involved in GSoC. I often mis-estimate the size of
> work (and yes I would still like your tree-based index format in,
> can't remember why it never made it).

So do I, and that's why I'd like to err on the side of having smaller
projects :)

I think the main reason the tree-based index format never made it is
that the in-core APIs were not set up to make use of the new index
format.  I'm also still interested in getting it in, but I haven't
found the time for looking at making the index code pluggable yet.  It
would probably take a similar refactoring as with the refs code to get
this done.

All that said, GSoC was still a great experience for me, and I got to
learn a ton over the summer.  But I did feel like I let the people
that invested a lot of time in the project as well down a bit, by not
being able to finishing the project.  And having the feeling of
accomplishment of actually finishing a project would definitely have
been nice to have as well.  So for those reasons I think it would be
better for students to take on smaller projects.

> So yeah if you find removing global variables (which essentially
> identifies shared states, a prerequisite for any parallel work)
> reasonable for GSoC, I'd say go for it.
> 
> Be also aware that this kind of refactoring work could result in lots
> of patches and it takes time to get them merged, if your GSoC goal is
> to get merged.
> -- 
> Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-03 10:12         ` Duy Nguyen
  2019-03-03 10:17           ` Duy Nguyen
@ 2019-03-05  4:51           ` Jeff King
  2019-03-05 12:57             ` Duy Nguyen
  1 sibling, 1 reply; 24+ messages in thread
From: Jeff King @ 2019-03-05  4:51 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Christian Couder, Thomas Gummerer, Matheus Tavares Bernardino,
	git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

On Sun, Mar 03, 2019 at 05:12:59PM +0700, Duy Nguyen wrote:

> On Sun, Mar 3, 2019 at 2:18 PM Christian Couder
> <christian.couder@gmail.com> wrote:
> > One thing I am still worried about is if we are sure that adding
> > parallelism is likely to get us a significant performance improvement
> > or not. If the performance of this code is bounded by disk or memory
> > access, then adding parallelism might not bring any benefit. (It could
> > perhaps decrease performance if memory locality gets worse.) So I'd
> > like some confirmation either by running some tests or by experienced
> > Git developers that it is likely to be a win.
> 
> This is a good point. My guess is the pack access consists of two
> parts: deflate zlib, resolve delta objects (which is just another form
> of compression) and actual I/O. The former is CPU bound and may take
> advantage of multiple cores. However, the cache we have kinda helps
> reduce CPU work load already, so perhaps the actual gain is not that
> much (or maybe we could just improve this cache to be more efficient).
> I'm adding Jeff, maybe he has done some experiments on parallel pack
> access, who knows.

Sorry, I don't have anything intelligent to add here. I do know that
`index-pack` doesn't scale well with more cores. I don't think I've ever
looked at adding parallel access to the packs themselves. I suspect it
would be tricky due to a few global variables (the pack windows, the
delta cache, etc).

> The second good thing from parallel pack access is not about utilizing
> processing power from multiple cores, but about _not_ blocking. I
> think one example use case here is parallel checkout. While one thread
> is blocked by pack access code for whatever reason, the others can
> still continue doing other stuff (e.g. write the checked out file to
> disk) or even access the pack again to check more things out.

I'm not sure if it would help much for packs, because they're organized
to have pretty good cold-cache read-ahead behavior. But who knows until
we measure it.

I do suspect that inflating (and delta reconstruction) done in parallel
could be a win for git-grep, especially if you have a really simple
regex that is quick to search.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-05  4:51           ` Jeff King
@ 2019-03-05 12:57             ` Duy Nguyen
  2019-03-05 23:46               ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 24+ messages in thread
From: Duy Nguyen @ 2019-03-05 12:57 UTC (permalink / raw)
  To: Jeff King
  Cc: Christian Couder, Thomas Gummerer, Matheus Tavares Bernardino,
	git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

On Tue, Mar 5, 2019 at 11:51 AM Jeff King <peff@peff.net> wrote:
> > processing power from multiple cores, but about _not_ blocking. I
> > think one example use case here is parallel checkout. While one thread
> > is blocked by pack access code for whatever reason, the others can
> > still continue doing other stuff (e.g. write the checked out file to
> > disk) or even access the pack again to check more things out.
>
> I'm not sure if it would help much for packs, because they're organized
> to have pretty good cold-cache read-ahead behavior. But who knows until
> we measure it.
>
> I do suspect that inflating (and delta reconstruction) done in parallel
> could be a win for git-grep, especially if you have a really simple
> regex that is quick to search.

Maybe git-blame too. But this is based purely on me watching CPU
utilization of one command with hot cache. For git-blame though, diff
code as to be thread safe too but that's another story.
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-03  7:18       ` Christian Couder
  2019-03-03 10:12         ` Duy Nguyen
@ 2019-03-05 23:03         ` Matheus Tavares Bernardino
  2019-03-06 23:17           ` Thomas Gummerer
  1 sibling, 1 reply; 24+ messages in thread
From: Matheus Tavares Bernardino @ 2019-03-05 23:03 UTC (permalink / raw)
  To: Christian Couder
  Cc: Thomas Gummerer, Duy Nguyen, git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

First of all, I must apologize for not replying during these last
days. I'm traveling and I rarely get a connection here. But I'll be
back March 11th.

On Sun, Mar 3, 2019 at 4:18 AM Christian Couder
<christian.couder@gmail.com> wrote:
>
> On Sat, Mar 2, 2019 at 4:09 PM Thomas Gummerer <t.gummerer@gmail.com> wrote:
> >
> > I'm a bit wary of a too large proposal here, as we've historically
> > overestimated what kind of project is achievable over a summer (I've
> > been there myself, as my GSoC project was also more than I was able to
> > do in a summer :)).  I'd rather have a project whose goal is rather
> > small and can be expanded later, than having something that could
> > potentially take more than 3 months, where the student (or their
> > mentors) have to finish it after GSoC.
>

I totally understand the concern.

> Yeah, I agree with your suggestion about a project that declares
> removing the global variables as the main goal, and adding parallelism
> as a potential bonus.
>

Talking about a delimited scope for GSoC and a potential bonus after,
a potential idea comes to my mind: I'm still trying to define the
subject for my undergraduate thesis (which must be in HPC and/or
parallelism on CPU/GPU). And the idea of bringing more parallelism to
git seems to be too big for a GSoC project. So, perhaps, if we manage
to identify wether parallelism would indeed bring a good performance
gain to git, I could propose that to my advisor professor as my
undergraduate thesis and I could work on that during this whole year.
It is still an idea to be matured, but do you think it would be
feasible?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-05 12:57             ` Duy Nguyen
@ 2019-03-05 23:46               ` Matheus Tavares Bernardino
  2019-03-06 10:17                 ` Duy Nguyen
  0 siblings, 1 reply; 24+ messages in thread
From: Matheus Tavares Bernardino @ 2019-03-05 23:46 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Jeff King, Christian Couder, Thomas Gummerer, git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

This exercise of estimating a good spot to gain performance with
parallelism at git seems more difficult than I thought, firstly. Also,
I'm not that familiar yet with git packing (neither with the sections
of it that could benefit from parallelism). So could anyone point me
some good references on this, where I could study and maybe come back
with more valuable suggestions?

On Tue, Mar 5, 2019 at 9:57 AM Duy Nguyen <pclouds@gmail.com> wrote:
>
> On Tue, Mar 5, 2019 at 11:51 AM Jeff King <peff@peff.net> wrote:
> > > processing power from multiple cores, but about _not_ blocking. I
> > > think one example use case here is parallel checkout. While one thread
> > > is blocked by pack access code for whatever reason, the others can
> > > still continue doing other stuff (e.g. write the checked out file to
> > > disk) or even access the pack again to check more things out.
> >

Hmm, you mean distributing the process of inflating, reconstructing
deltas and checking out files between the threads? (having each one
doing the process for a different file?)

> > I'm not sure if it would help much for packs, because they're organized
> > to have pretty good cold-cache read-ahead behavior. But who knows until
> > we measure it.
> >
> > I do suspect that inflating (and delta reconstruction) done in parallel
> > could be a win for git-grep, especially if you have a really simple
> > regex that is quick to search.
>
> Maybe git-blame too. But this is based purely on me watching CPU
> utilization of one command with hot cache. For git-blame though, diff
> code as to be thread safe too but that's another story.

I don't know if this relates to parallelizing pack access, but I
thought that sharing this with you all could perhaps bring some new
insights (maybe even on parallelizing some other git section): I asked
my friends who contribute to the Linux Kernel what git commands seems
to take longer during their kernel work, and the answers were:
- git log and git status, sometimes
- using pager's search at git log
- checking out to an old commit
- git log --oneline --decorate --graph

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-05 23:46               ` Matheus Tavares Bernardino
@ 2019-03-06 10:17                 ` Duy Nguyen
  2019-03-12  0:18                   ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 24+ messages in thread
From: Duy Nguyen @ 2019-03-06 10:17 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Jeff King, Christian Couder, Thomas Gummerer, git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

On Wed, Mar 6, 2019 at 6:47 AM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> This exercise of estimating a good spot to gain performance with
> parallelism at git seems more difficult than I thought, firstly. Also,
> I'm not that familiar yet with git packing (neither with the sections
> of it that could benefit from parallelism). So could anyone point me
> some good references on this, where I could study and maybe come back
> with more valuable suggestions?

I think you should skim through
Documentation/technical/pack-format.txt to have an idea what we're
talking about here (inflation, delta cache...).

The short (and slightly inaccurate) version is, in order to give the
content of a SHA-1, we need to

1. locate that "object" in the pack file (I'm ignoring loose objects)

2. assume (worst case) that this object is a delta object, it contains
only modification of another object, so you would need to get the
content of the other object first, then apply these modification ("the
delta") to get the content. This could repeat many times

3. once you get full content, it's actually zlib compressed, so you
need to uncompress/inflate it.

Step 2 would be super slow if you have to uncompress that "another
object" every single time. So there's a "delta cache" which stores the
content of these "other objects". Next time you see a delta object on
top of one of these in the cache, you can just apply the delta and be
done with it. Much faster. This delta cache is of course global and
not thread safe.

Step 3 can also be slow when dealing with large blobs.

Another global state that Jeff mentions is pack windows I think is a
bit harder to explain quickly. But basically we have many "windows" to
see the raw content of a pack file, these windows are global (per pack
actually) and are also not thread safe.

So all these steps are not thread safe. When a command with multiple
thread support accesses the pack, all those steps are protected by a
single mutex (it's grep_mutex in builtin/grep.c or read_lock() in
builtin/pack-objects.c). As you can see steps here are CPU-bound (step
3 is obvious, step 2 will have to inflate the other objects), so if
you use a more fine-grained mutex, chances are the inflation step can
be done in parallel.

I think the good spots to gain performance are the commands that
already have multiple thread support. I mentioned git-grep and
git-pack-objects.c above. git-index-pack is the third one but it's
special and I think does not use general pack access code.

I think if we could somehow measure lock contention of those big locks
above we can guesstimate how much gain there is. If threads in
git-grep for example often have to wait for grep_mutex, then in the
best case scenario when you make pack access thread safe, lock
contention goes down to near zero.

> On Tue, Mar 5, 2019 at 9:57 AM Duy Nguyen <pclouds@gmail.com> wrote:
> >
> > On Tue, Mar 5, 2019 at 11:51 AM Jeff King <peff@peff.net> wrote:
> > > > processing power from multiple cores, but about _not_ blocking. I
> > > > think one example use case here is parallel checkout. While one thread
> > > > is blocked by pack access code for whatever reason, the others can
> > > > still continue doing other stuff (e.g. write the checked out file to
> > > > disk) or even access the pack again to check more things out.
> > >
>
> Hmm, you mean distributing the process of inflating, reconstructing
> deltas and checking out files between the threads? (having each one
> doing the process for a different file?)

Yes. So if one thread hits a giant file (and spends lot of time
inflating), the other threads can still go on inflate smaller files
and writing them to disk.

> > > I'm not sure if it would help much for packs, because they're organized
> > > to have pretty good cold-cache read-ahead behavior. But who knows until
> > > we measure it.
> > >
> > > I do suspect that inflating (and delta reconstruction) done in parallel
> > > could be a win for git-grep, especially if you have a really simple
> > > regex that is quick to search.
> >
> > Maybe git-blame too. But this is based purely on me watching CPU
> > utilization of one command with hot cache. For git-blame though, diff
> > code as to be thread safe too but that's another story.
>
> I don't know if this relates to parallelizing pack access, but I
> thought that sharing this with you all could perhaps bring some new
> insights (maybe even on parallelizing some other git section): I asked
> my friends who contribute to the Linux Kernel what git commands seems
> to take longer during their kernel work, and the answers were:
> - git log and git status, sometimes
> - using pager's search at git log
> - checking out to an old commit
> - git log --oneline --decorate --graph

Sometimes the slowness is not because of serialized pack access. I'm
sure git-status is not that much impacted by slow pack access.

The pager search case may be sped up (git-log has to look at lots of
blobs and trees) but you also need to parallelize diff code as well
and that I think is way too big to consider.

The checking out old commit, I think could also be sped up with
parallel checkout. Again this is really out of scope. But of course it
can't be done until pack access is thread safe. Or can be done but not
as pretty [1]. [1] suggests 30% speedup. That's the share-nothing best
possible case, I think. But that code is no way optimized so real
speedup could be higher.

[1] https://public-inbox.org/git/20160415095139.GA3985@lanh/
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-05 23:03         ` Matheus Tavares Bernardino
@ 2019-03-06 23:17           ` Thomas Gummerer
  0 siblings, 0 replies; 24+ messages in thread
From: Thomas Gummerer @ 2019-03-06 23:17 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Christian Couder, Duy Nguyen, git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

On 03/05, Matheus Tavares Bernardino wrote:
> First of all, I must apologize for not replying during these last
> days. I'm traveling and I rarely get a connection here. But I'll be
> back March 11th.
> 
> On Sun, Mar 3, 2019 at 4:18 AM Christian Couder
> <christian.couder@gmail.com> wrote:
> >
> > On Sat, Mar 2, 2019 at 4:09 PM Thomas Gummerer <t.gummerer@gmail.com> wrote:
> > >
> > > I'm a bit wary of a too large proposal here, as we've historically
> > > overestimated what kind of project is achievable over a summer (I've
> > > been there myself, as my GSoC project was also more than I was able to
> > > do in a summer :)).  I'd rather have a project whose goal is rather
> > > small and can be expanded later, than having something that could
> > > potentially take more than 3 months, where the student (or their
> > > mentors) have to finish it after GSoC.
> >
> 
> I totally understand the concern.
> 
> > Yeah, I agree with your suggestion about a project that declares
> > removing the global variables as the main goal, and adding parallelism
> > as a potential bonus.
> >
> 
> Talking about a delimited scope for GSoC and a potential bonus after,
> a potential idea comes to my mind: I'm still trying to define the
> subject for my undergraduate thesis (which must be in HPC and/or
> parallelism on CPU/GPU). And the idea of bringing more parallelism to
> git seems to be too big for a GSoC project. So, perhaps, if we manage
> to identify wether parallelism would indeed bring a good performance
> gain to git, I could propose that to my advisor professor as my
> undergraduate thesis and I could work on that during this whole year.
> It is still an idea to be matured, but do you think it would be
> feasible?

I think this idea is generally fine, but your project proposal should
only be about the parts that are going to be included in your GSoC
project.  Of course we love to see GSoC students that are still
actively participating in the project after GSoC.  However that should
not be part of the sponsored program.  Mentors may also be less
available outside of the program, as mentoring can take quite some
time commitment.

Another slight concern here is that the mentors may be less familiar
with this project than others (I for one am not very familiar, and
Christian mentioned similar concerns), so it may be harder to give you
good advice.

All that said, if you are still excited about this project, think
about how to address these concerns in your proposal.  And feel free
to discuss the proposal on the mailing list as well, before actually
submitting it.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-06 10:17                 ` Duy Nguyen
@ 2019-03-12  0:18                   ` Matheus Tavares Bernardino
  2019-03-12 10:02                     ` Duy Nguyen
  0 siblings, 1 reply; 24+ messages in thread
From: Matheus Tavares Bernardino @ 2019-03-12  0:18 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Jeff King, Christian Couder, Thomas Gummerer, git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

On Wed, Mar 6, 2019 at 7:17 AM Duy Nguyen <pclouds@gmail.com> wrote:
>
> On Wed, Mar 6, 2019 at 6:47 AM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > This exercise of estimating a good spot to gain performance with
> > parallelism at git seems more difficult than I thought, firstly. Also,
> > I'm not that familiar yet with git packing (neither with the sections
> > of it that could benefit from parallelism). So could anyone point me
> > some good references on this, where I could study and maybe come back
> > with more valuable suggestions?
>
> I think you should skim through
> Documentation/technical/pack-format.txt to have an idea what we're
> talking about here (inflation, delta cache...).

Thanks for the recommendation. It was a very explanatory reading.

> The short (and slightly inaccurate) version is, in order to give the
> content of a SHA-1, we need to
>
> 1. locate that "object" in the pack file (I'm ignoring loose objects)
>
> 2. assume (worst case) that this object is a delta object, it contains
> only modification of another object, so you would need to get the
> content of the other object first, then apply these modification ("the
> delta") to get the content. This could repeat many times
>
> 3. once you get full content, it's actually zlib compressed, so you
> need to uncompress/inflate it.
>
> Step 2 would be super slow if you have to uncompress that "another
> object" every single time. So there's a "delta cache" which stores the
> content of these "other objects". Next time you see a delta object on
> top of one of these in the cache, you can just apply the delta and be
> done with it. Much faster. This delta cache is of course global and
> not thread safe.
>
> Step 3 can also be slow when dealing with large blobs.

Thanks for the explanation.

> Another global state that Jeff mentions is pack windows I think is a
> bit harder to explain quickly. But basically we have many "windows" to
> see the raw content of a pack file, these windows are global (per pack
> actually) and are also not thread safe.
>
> So all these steps are not thread safe. When a command with multiple
> thread support accesses the pack, all those steps are protected by a
> single mutex (it's grep_mutex in builtin/grep.c or read_lock() in
> builtin/pack-objects.c). As you can see steps here are CPU-bound (step
> 3 is obvious, step 2 will have to inflate the other objects), so if
> you use a more fine-grained mutex, chances are the inflation step can
> be done in parallel.
>
> I think the good spots to gain performance are the commands that
> already have multiple thread support. I mentioned git-grep and
> git-pack-objects.c above. git-index-pack is the third one but it's
> special and I think does not use general pack access code.
>
> I think if we could somehow measure lock contention of those big locks
> above we can guesstimate how much gain there is. If threads in
> git-grep for example often have to wait for grep_mutex, then in the
> best case scenario when you make pack access thread safe, lock
> contention goes down to near zero.

I've been thinking on how I could implement a test to estimate the
lock contention but had no success until now. I wanted to try
mutrace[2] but couldn't install it; I tried valgrind's drd but it
didn't seem to report a contention time estimation; And I tried
measuring the time of "pthread_mutex_lock(&grep_mutex)" but I don't
know how much significative this value is, since we can't directly
compare it (the time of many threads at lock) with the overall
execution time. Do you have an idea on how we could measure the lock
contention here?

Another thing that is occurring to me right now is whether git-grep,
as it is implemented today, would really benefit from thread-safe pack
access. I may have to study the code more, but it seems to me that
just the producer thread uses pack access.

[2] http://0pointer.de/blog/projects/mutrace.html

> > On Tue, Mar 5, 2019 at 9:57 AM Duy Nguyen <pclouds@gmail.com> wrote:
> > >
> > > On Tue, Mar 5, 2019 at 11:51 AM Jeff King <peff@peff.net> wrote:
> > > > > processing power from multiple cores, but about _not_ blocking. I
> > > > > think one example use case here is parallel checkout. While one thread
> > > > > is blocked by pack access code for whatever reason, the others can
> > > > > still continue doing other stuff (e.g. write the checked out file to
> > > > > disk) or even access the pack again to check more things out.
> > > >
> >
> > Hmm, you mean distributing the process of inflating, reconstructing
> > deltas and checking out files between the threads? (having each one
> > doing the process for a different file?)
>
> Yes. So if one thread hits a giant file (and spends lot of time
> inflating), the other threads can still go on inflate smaller files
> and writing them to disk.
>
> > > > I'm not sure if it would help much for packs, because they're organized
> > > > to have pretty good cold-cache read-ahead behavior. But who knows until
> > > > we measure it.
> > > >
> > > > I do suspect that inflating (and delta reconstruction) done in parallel
> > > > could be a win for git-grep, especially if you have a really simple
> > > > regex that is quick to search.
> > >
> > > Maybe git-blame too. But this is based purely on me watching CPU
> > > utilization of one command with hot cache. For git-blame though, diff
> > > code as to be thread safe too but that's another story.
> >
> > I don't know if this relates to parallelizing pack access, but I
> > thought that sharing this with you all could perhaps bring some new
> > insights (maybe even on parallelizing some other git section): I asked
> > my friends who contribute to the Linux Kernel what git commands seems
> > to take longer during their kernel work, and the answers were:
> > - git log and git status, sometimes
> > - using pager's search at git log
> > - checking out to an old commit
> > - git log --oneline --decorate --graph
>
> Sometimes the slowness is not because of serialized pack access. I'm
> sure git-status is not that much impacted by slow pack access.
>
> The pager search case may be sped up (git-log has to look at lots of
> blobs and trees) but you also need to parallelize diff code as well
> and that I think is way too big to consider.
>
> The checking out old commit, I think could also be sped up with
> parallel checkout. Again this is really out of scope. But of course it
> can't be done until pack access is thread safe. Or can be done but not
> as pretty [1]. [1] suggests 30% speedup. That's the share-nothing best
> possible case, I think. But that code is no way optimized so real
> speedup could be higher.
>
> [1] https://public-inbox.org/git/20160415095139.GA3985@lanh/

So, although it would be out of scope for GSoC, checkout, diff and log
(and maybe others) could all benefit from a thread-safe/parallel pack
access, right? If so, it is very motivating the impact this project
could, in theory, have :)

> --
> Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-12  0:18                   ` Matheus Tavares Bernardino
@ 2019-03-12 10:02                     ` Duy Nguyen
  2019-03-12 10:11                       ` Duy Nguyen
  0 siblings, 1 reply; 24+ messages in thread
From: Duy Nguyen @ 2019-03-12 10:02 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Jeff King, Christian Couder, Thomas Gummerer, git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

On Mon, Mar 11, 2019 at 09:18:47PM -0300, Matheus Tavares Bernardino wrote:
> I've been thinking on how I could implement a test to estimate the
> lock contention but had no success until now. I wanted to try
> mutrace[2] but couldn't install it; I tried valgrind's drd but it
> didn't seem to report a contention time estimation; And I tried
> measuring the time of "pthread_mutex_lock(&grep_mutex)" but I don't
> know how much significative this value is, since we can't directly
> compare it (the time of many threads at lock) with the overall
> execution time. Do you have an idea on how we could measure the lock
> contention here?

(I'm supposed to be doing something else, but this is more fun :D)

Yeah lock contention is probably hard to measure. But at least we can
measure how much time is blocked by mutex. Something like this [1]
gives me

    $ time ~/w/git/temp/git grep --threads=8 abc HEAD >/dev/null
    warning: let's have some fun
    block_time = 20ms
    block_count = 10725

    real    0m0,379s
    user    0m0,425s
    sys     0m0,073s

From this I know "git grep" took 379ms and at least 20ms (probably
including measurement overhead) is wasted on pthread_mutex_lock(). It
does not look that significant, I admit.

> Another thing that is occurring to me right now is whether git-grep,
> as it is implemented today, would really benefit from thread-safe pack
> access. I may have to study the code more, but it seems to me that
> just the producer thread uses pack access.

That producer I think is just handing out assignments to worker
threads. The real work is still done by worker threads.

The entry point to pack access in this code is protected by
grep_read_lock(). I believe the multithread one is in this deep call
chain (I found it out by gdb)

[main thread] grep_oid() -> add_work() ->
[worker thread] grep_source() -> grep_source_1() ->
grep_source_is_binary() -> grep_source_load() ->
grep_source_load_oid() -> read_object_file()

Note that there's another source of pack access, the
lock_and_read_oid_file() in grep_tree(). This is where we unpack tree
objects and traverse to get blob SHA-1.

This code is currently on the main thread (maybe this is what you
found?) so it does not really benefit from multi thread. We could
still add some sort of lookahead queue to inflate tree objects in
advance in parallel, but I don't know how much gain that will be.

One thing I didn't notice is we currently force no threads in the case
we need pack access, e.g. "git grep <regex> <commit>" or "git grep
--cached <regex>". So if you need to experiment, you need to hack it
and remove that restriction. That's the "let's have some fun" code in
[1].

So, assuming this is CPU bottleneck, let's have a look at how CPU is used.

    perf record git grep --threads=1 abc HEAD >/dev/null
    perf report

shows me the top CPU consumers are

  51,16%  git      libz.so.1.2.11      [.] inflate_fast
  19,55%  git      git                 [.] bmexec

You need to do some guessing here. But I believe the top call is from
inflating objects (either from packs or from loose objects). The
second one is from regular expression engine.

Since the regex here is very short (I deliberately try Jeff's case
where object access dominates), the CPU is mostly used up for object
inflation. That suggests that if we could spread it out across cores,
the gain could be quite good. There aren't many dependencies in grep
tasks so if we spread the workload on 8 cores, execution time should
be reduced by 7 or 8 times.

For fun, I hacked up a horrible, horrible "thread safe" version [2]
that probably broke 99% of git and made helgrind extremely unhappy, so
these numbers are at best guidelines, but..

    $ time ./git grep --threads=1 abc HEAD >/dev/null

    real    0m0,253s
    user    0m0,225s
    sys     0m0,029s

    $ time ./git grep --threads=8 abc HEAD >/dev/null
    warning: let's have some fun!

    real    0m0,157s
    user    0m0,312s
    sys     0m0,089s

You can see the "real" rows show quite good time reduction. Not 8
times reduction, mind you, but that's where serious people dig in and
really speed it up ;-)

[1] https://gitlab.com/snippets/1834609
[2] https://gitlab.com/snippets/1834613

> So, although it would be out of scope for GSoC, checkout, diff and log
> (and maybe others) could all benefit from a thread-safe/parallel pack
> access, right? If so, it is very motivating the impact this project
> could, in theory, have :)

We have to analyze case by case. It may turn out that there are many
opportunity to utilize multi threads. I think checkout is definitely a
good candidate. For "git diff" and "git log" maybe you can try "perf"
to see how much workload is locked in pack access (mostly inflation,
because it's easier to spot from the profile report)
--
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-12 10:02                     ` Duy Nguyen
@ 2019-03-12 10:11                       ` Duy Nguyen
  2019-04-04  1:15                         ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 24+ messages in thread
From: Duy Nguyen @ 2019-03-12 10:11 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Jeff King, Christian Couder, Thomas Gummerer, git,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane

On Tue, Mar 12, 2019 at 5:02 PM Duy Nguyen <pclouds@gmail.com> wrote:
> We have to analyze case by case. It may turn out that there are many

s/are/aren't/

> opportunity to utilize multi threads. I think checkout is definitely a
> good candidate. For "git diff" and "git log" maybe you can try "perf"
> to see how much workload is locked in pack access (mostly inflation,
> because it's easier to spot from the profile report)
-- 
Duy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-03-12 10:11                       ` Duy Nguyen
@ 2019-04-04  1:15                         ` Matheus Tavares Bernardino
  2019-04-04  7:56                           ` Christian Couder
  0 siblings, 1 reply; 24+ messages in thread
From: Matheus Tavares Bernardino @ 2019-04-04  1:15 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Duy Nguyen, Christian Couder, Thomas Gummerer,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane, David Kastrup

Hi,

I've been studying the codebase and looking for older emails in the ML
that discussed what I want to propose as my GSoC project. In
particular, I found a thread about slow git commands on chromium, so I
reached them out at chromium's ML to ask if it's still an issue. I got
the following answer:

On Wed, Apr 3, 2019 at 1:41 PM Erik Chen <erikchen@chromium.org> wrote:
> Yes, this is absolutely still a problem for Chrome. I filed some bugs for common operations that are slow for Chrome: git blame [1], git stash [2], git status [3]
> On Linux, blame is the only operation that is really problematic. On macOS and Windows ... it's hard to find a git operation that isn't slow. :(

I don't really know if treading would help stash and status, but I
think it could help blame. By the little I've read of blame's code so
far, my guess is that the priority queue used for the commits could be
an interface for a producer-consumer mechanism and that way,
assign_blame's main loop could be done in parallel. And as we can se
at [4], that is 90% of the command's time. Does this makes sense?

But as Duy pointed out, if I recall correctly, for git blame to be
parallel, pack access and diff code would have to be thread-safe
first. And also, it seems, by what we've talked earlier, that this
much wouldn't fit all together in a single GSoC. So, would it be a
nice GSoC proposal to try "making code used by blame thread-safe",
targeting a future parallelism on blame to be done after GSoC? And if
so, could you please point me out which files should I be studying to
write the planning for this proposal? (Unfortunately I wasn't able to
study pack access and diff code yet. I got carried on looking for
performance hostposts and now I'm a bit behind schedule :(

Also, an implementation for fuzzy blame is being developer right
now[5] and Jeff (CC-ed) suggested recently another performance
improvement that could be done in blame[6]. So I would like to know
wether you think it is worthy putting efforts trying to parallelize
it.

Thanks again for all the support.
Matheus Tavares

[1] https://bugs.chromium.org/p/git/issues/detail?id=18&q=
[2] https://bugs.chromium.org/p/git/issues/detail?id=17&q=
[3] https://bugs.chromium.org/p/git/issues/detail?id=16&q=
[4] https://i.imgur.com/XmyJMuE.png
[5] https://public-inbox.org/git/20190403160207.149174-1-brho@google.com/T/#mb772910506e99e00337e4606f8377f95ef41c8c6
[6] https://public-inbox.org/git/87ftqz5osx.fsf@fencepost.gnu.org/T/#m9caa63d35d39de8d2b7c5c6852f331dc2328e1ea

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-04-04  1:15                         ` Matheus Tavares Bernardino
@ 2019-04-04  7:56                           ` Christian Couder
  2019-04-04  8:20                             ` Mike Hommey
  2019-04-05 16:28                             ` Matheus Tavares Bernardino
  0 siblings, 2 replies; 24+ messages in thread
From: Christian Couder @ 2019-04-04  7:56 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: git, Jeff King, Duy Nguyen, Thomas Gummerer,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane, David Kastrup

Hi,

On Thu, Apr 4, 2019 at 3:15 AM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> I've been studying the codebase and looking for older emails in the ML
> that discussed what I want to propose as my GSoC project. In
> particular, I found a thread about slow git commands on chromium, so I
> reached them out at chromium's ML to ask if it's still an issue. I got
> the following answer:
>
> On Wed, Apr 3, 2019 at 1:41 PM Erik Chen <erikchen@chromium.org> wrote:
> > Yes, this is absolutely still a problem for Chrome. I filed some bugs for common operations that are slow for Chrome: git blame [1], git stash [2], git status [3]
> > On Linux, blame is the only operation that is really problematic. On macOS and Windows ... it's hard to find a git operation that isn't slow. :(

Nice investigation. About git status I wonder though if they have
tried the possible optimizations, like untracked cache or
core.fsmonitor.

> I don't really know if treading would help stash and status, but I
> think it could help blame. By the little I've read of blame's code so
> far, my guess is that the priority queue used for the commits could be
> an interface for a producer-consumer mechanism and that way,
> assign_blame's main loop could be done in parallel. And as we can se
> at [4], that is 90% of the command's time. Does this makes sense?

I can't really tell as I haven't studied this, but from the links in
your email I think it kind of makes sense.

Instead of doing assign_blame()'s main loop in parallel though, if my
focus was only making git blame faster, I think I would first try to
cache xdl_hash_record() results and then if possible to compute
xdl_hash_record() in parallel as it seems to be a big bottleneck and a
quite low hanging fruit.

> But as Duy pointed out, if I recall correctly, for git blame to be
> parallel, pack access and diff code would have to be thread-safe
> first. And also, it seems, by what we've talked earlier, that this
> much wouldn't fit all together in a single GSoC. So, would it be a
> nice GSoC proposal to try "making code used by blame thread-safe",
> targeting a future parallelism on blame to be done after GSoC?

Yeah, I think it would be a nice proposal, even though it doesn't seem
to be the most straightforward way to make git blame faster.

Back in 2008 when we proposed a GSoC about creating a sequencer, it
wasn't something that would easily fit in a GSoC, and in fact it
didn't, but over the long run it has been very fruitful as the
sequencer is now used by cherry-pick and rebase -i, and there are
plans to use it even more. So unless people think it's not a good idea
for some reason, which hasn't been the case yet, I am ok with a GSoC
project like this.

> And if
> so, could you please point me out which files should I be studying to
> write the planning for this proposal? (Unfortunately I wasn't able to
> study pack access and diff code yet. I got carried on looking for
> performance hostposts and now I'm a bit behind schedule :(

I don't think you need to study everything yet, and I think you
already did a lot of studying, so I would suggest you first try to
send soon a proposal with the information you have right now, and then
depending on the feedback you get and the time left (likely not
much!!!), you might study some parts of the code a bit more later.

> Also, an implementation for fuzzy blame is being developer right
> now[5] and Jeff (CC-ed) suggested recently another performance
> improvement that could be done in blame[6]. So I would like to know
> wether you think it is worthy putting efforts trying to parallelize
> it.

What you would do seems compatible to me with the fuzzy blame effort
and an effort to cache xdl_hash_record() results.

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-04-04  7:56                           ` Christian Couder
@ 2019-04-04  8:20                             ` Mike Hommey
  2019-04-05 16:28                             ` Matheus Tavares Bernardino
  1 sibling, 0 replies; 24+ messages in thread
From: Mike Hommey @ 2019-04-04  8:20 UTC (permalink / raw)
  To: Christian Couder
  Cc: Matheus Tavares Bernardino, git, Jeff King, Duy Nguyen,
	Thomas Gummerer,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane, David Kastrup

On Thu, Apr 04, 2019 at 09:56:35AM +0200, Christian Couder wrote:
> Hi,
> 
> On Thu, Apr 4, 2019 at 3:15 AM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > I've been studying the codebase and looking for older emails in the ML
> > that discussed what I want to propose as my GSoC project. In
> > particular, I found a thread about slow git commands on chromium, so I
> > reached them out at chromium's ML to ask if it's still an issue. I got
> > the following answer:
> >
> > On Wed, Apr 3, 2019 at 1:41 PM Erik Chen <erikchen@chromium.org> wrote:
> > > Yes, this is absolutely still a problem for Chrome. I filed some bugs for common operations that are slow for Chrome: git blame [1], git stash [2], git status [3]
> > > On Linux, blame is the only operation that is really problematic. On macOS and Windows ... it's hard to find a git operation that isn't slow. :(
> 
> Nice investigation. About git status I wonder though if they have
> tried the possible optimizations, like untracked cache or
> core.fsmonitor.

Note that one known issue on macOS specifically on large repos is that
the number of files explodes the inode cache in the kernel, and lstat
becomes slow.

e.g. https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/p9U4jqz8kxo

Mike

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-04-04  7:56                           ` Christian Couder
  2019-04-04  8:20                             ` Mike Hommey
@ 2019-04-05 16:28                             ` Matheus Tavares Bernardino
  2019-04-07 23:40                               ` Christian Couder
  1 sibling, 1 reply; 24+ messages in thread
From: Matheus Tavares Bernardino @ 2019-04-05 16:28 UTC (permalink / raw)
  To: Christian Couder
  Cc: git, Jeff King, Duy Nguyen, Thomas Gummerer,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane, David Kastrup

On Thu, Apr 4, 2019 at 4:56 AM Christian Couder
<christian.couder@gmail.com> wrote:
>
> Hi,
>
> On Thu, Apr 4, 2019 at 3:15 AM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > I've been studying the codebase and looking for older emails in the ML
> > that discussed what I want to propose as my GSoC project. In
> > particular, I found a thread about slow git commands on chromium, so I
> > reached them out at chromium's ML to ask if it's still an issue. I got
> > the following answer:
> >
> > On Wed, Apr 3, 2019 at 1:41 PM Erik Chen <erikchen@chromium.org> wrote:
> > > Yes, this is absolutely still a problem for Chrome. I filed some bugs for common operations that are slow for Chrome: git blame [1], git stash [2], git status [3]
> > > On Linux, blame is the only operation that is really problematic. On macOS and Windows ... it's hard to find a git operation that isn't slow. :(
>
> Nice investigation. About git status I wonder though if they have
> tried the possible optimizations, like untracked cache or
> core.fsmonitor.

I don't know if they did, but I suggested them to check
core.commitGraph, pack.useBitmaps and core.untrackedCache (which Duy
suggested me in another thread).

> > I don't really know if treading would help stash and status, but I
> > think it could help blame. By the little I've read of blame's code so
> > far, my guess is that the priority queue used for the commits could be
> > an interface for a producer-consumer mechanism and that way,
> > assign_blame's main loop could be done in parallel. And as we can se
> > at [4], that is 90% of the command's time. Does this makes sense?
>
> I can't really tell as I haven't studied this, but from the links in
> your email I think it kind of makes sense.
>
> Instead of doing assign_blame()'s main loop in parallel though, if my
> focus was only making git blame faster, I think I would first try to
> cache xdl_hash_record() results and then if possible to compute
> xdl_hash_record() in parallel as it seems to be a big bottleneck and a
> quite low hanging fruit.

Hm, I see. But although it would take more effort to add threading at
assign_blame(), wouldn't it be better because more work could be done
in parallel? I think it could be implemented in the same fashion git
grep does.

> > But as Duy pointed out, if I recall correctly, for git blame to be
> > parallel, pack access and diff code would have to be thread-safe
> > first. And also, it seems, by what we've talked earlier, that this
> > much wouldn't fit all together in a single GSoC. So, would it be a
> > nice GSoC proposal to try "making code used by blame thread-safe",
> > targeting a future parallelism on blame to be done after GSoC?
>
> Yeah, I think it would be a nice proposal, even though it doesn't seem
> to be the most straightforward way to make git blame faster.
>
> Back in 2008 when we proposed a GSoC about creating a sequencer, it
> wasn't something that would easily fit in a GSoC, and in fact it
> didn't, but over the long run it has been very fruitful as the
> sequencer is now used by cherry-pick and rebase -i, and there are
> plans to use it even more. So unless people think it's not a good idea
> for some reason, which hasn't been the case yet, I am ok with a GSoC
> project like this.
>
> > And if
> > so, could you please point me out which files should I be studying to
> > write the planning for this proposal? (Unfortunately I wasn't able to
> > study pack access and diff code yet. I got carried on looking for
> > performance hostposts and now I'm a bit behind schedule :(
>
> I don't think you need to study everything yet, and I think you
> already did a lot of studying, so I would suggest you first try to
> send soon a proposal with the information you have right now, and then
> depending on the feedback you get and the time left (likely not
> much!!!), you might study some parts of the code a bit more later.

Thanks a lot, Christian. I'm writing my proposal and will try to send it today.

> > Also, an implementation for fuzzy blame is being developer right
> > now[5] and Jeff (CC-ed) suggested recently another performance
> > improvement that could be done in blame[6]. So I would like to know
> > wether you think it is worthy putting efforts trying to parallelize
> > it.
>
> What you would do seems compatible to me with the fuzzy blame effort
> and an effort to cache xdl_hash_record() results.
>
> Thanks,
> Christian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Questions on GSoC 2019 Ideas
  2019-04-05 16:28                             ` Matheus Tavares Bernardino
@ 2019-04-07 23:40                               ` Christian Couder
  0 siblings, 0 replies; 24+ messages in thread
From: Christian Couder @ 2019-04-07 23:40 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: git, Jeff King, Duy Nguyen, Thomas Gummerer,
	Оля Тележная,
	Elijah Newren, Tanushree Tumane, David Kastrup

On Fri, Apr 5, 2019 at 6:29 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Thu, Apr 4, 2019 at 4:56 AM Christian Couder
> <christian.couder@gmail.com> wrote:
> >
> > Nice investigation. About git status I wonder though if they have
> > tried the possible optimizations, like untracked cache or
> > core.fsmonitor.
>
> I don't know if they did, but I suggested them to check
> core.commitGraph, pack.useBitmaps and core.untrackedCache (which Duy
> suggested me in another thread).

Thanks! It would be nice to know if it has improved things for them.

> > I can't really tell as I haven't studied this, but from the links in
> > your email I think it kind of makes sense.
> >
> > Instead of doing assign_blame()'s main loop in parallel though, if my
> > focus was only making git blame faster, I think I would first try to
> > cache xdl_hash_record() results and then if possible to compute
> > xdl_hash_record() in parallel as it seems to be a big bottleneck and a
> > quite low hanging fruit.
>
> Hm, I see. But although it would take more effort to add threading at
> assign_blame(), wouldn't it be better because more work could be done
> in parallel? I think it could be implemented in the same fashion git
> grep does.

Yeah, I agree that it seems to be worth the effort.

> > I don't think you need to study everything yet, and I think you
> > already did a lot of studying, so I would suggest you first try to
> > send soon a proposal with the information you have right now, and then
> > depending on the feedback you get and the time left (likely not
> > much!!!), you might study some parts of the code a bit more later.
>
> Thanks a lot, Christian. I'm writing my proposal and will try to send it today.

Great!

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-04-07 23:43 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-28 21:46 Questions on GSoC 2019 Ideas Matheus Tavares Bernardino
2019-02-28 22:07 ` Christian Couder
2019-03-01  9:30   ` Duy Nguyen
2019-03-02 15:09     ` Thomas Gummerer
2019-03-03  7:18       ` Christian Couder
2019-03-03 10:12         ` Duy Nguyen
2019-03-03 10:17           ` Duy Nguyen
2019-03-05  4:51           ` Jeff King
2019-03-05 12:57             ` Duy Nguyen
2019-03-05 23:46               ` Matheus Tavares Bernardino
2019-03-06 10:17                 ` Duy Nguyen
2019-03-12  0:18                   ` Matheus Tavares Bernardino
2019-03-12 10:02                     ` Duy Nguyen
2019-03-12 10:11                       ` Duy Nguyen
2019-04-04  1:15                         ` Matheus Tavares Bernardino
2019-04-04  7:56                           ` Christian Couder
2019-04-04  8:20                             ` Mike Hommey
2019-04-05 16:28                             ` Matheus Tavares Bernardino
2019-04-07 23:40                               ` Christian Couder
2019-03-05 23:03         ` Matheus Tavares Bernardino
2019-03-06 23:17           ` Thomas Gummerer
2019-03-03 10:03       ` Duy Nguyen
2019-03-03 16:12         ` Thomas Gummerer
2019-03-01 15:20   ` Johannes Schindelin

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).