git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Common thread pool API in Git?
@ 2019-09-26  0:33 Jonathan Tan
  2019-09-26  0:35 ` Jonathan Tan
  2019-09-27 22:14 ` Jeff King
  0 siblings, 2 replies; 3+ messages in thread
From: Jonathan Tan @ 2019-09-26  0:33 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

Does anyone have ideas or plans for this? I know that (at least) "grep"
and "index-pack" have their own implementations, and it would be great
to just have one that all code can use.

For those who want to know, this question was motivated by a big delta
tree occurring in [1]. index-pack does parallelize delta resolution, but
it cannot split up trees into threads: each delta base root can go into
its own thread, but when a delta base root is processed, all deltas on
that root (direct or indirect) is processed in the same thread. I took a
look to see if this could be parallelized, but thought that it might be
better to standardize on a thread pool implementation first.

Searching reveals [2], but I don't think it ever made it into Git.

[1] git -c core.deltabasecachelimit=2g clone https://fuchsia.googlesource.com/third_party/vulkan-cts
[2] https://public-inbox.org/git/1440121237-24576-2-git-send-email-sbeller@google.com/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Common thread pool API in Git?
  2019-09-26  0:33 Common thread pool API in Git? Jonathan Tan
@ 2019-09-26  0:35 ` Jonathan Tan
  2019-09-27 22:14 ` Jeff King
  1 sibling, 0 replies; 3+ messages in thread
From: Jonathan Tan @ 2019-09-26  0:35 UTC (permalink / raw)
  To: jonathantanmy; +Cc: git

> For those who want to know, this question was motivated by a big delta
> tree occurring in [1].

Forgot to mention: this is not an issue of the server not repacking with
a chain length limit. The tree in question is indeed not tall, but it is
very wide.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Common thread pool API in Git?
  2019-09-26  0:33 Common thread pool API in Git? Jonathan Tan
  2019-09-26  0:35 ` Jonathan Tan
@ 2019-09-27 22:14 ` Jeff King
  1 sibling, 0 replies; 3+ messages in thread
From: Jeff King @ 2019-09-27 22:14 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Wed, Sep 25, 2019 at 05:33:00PM -0700, Jonathan Tan wrote:

> Does anyone have ideas or plans for this? I know that (at least) "grep"
> and "index-pack" have their own implementations, and it would be great
> to just have one that all code can use.
> [..]
> Searching reveals [2], but I don't think it ever made it into Git.

I think that's correct. I had a recollection of this being discussed in
the past, but I'm pretty sure I was thinking of the patches you already
linked to. We didn't take it because the proposed callers actually were
starting sub-processes anyway, so we ended up instead with
run_processes_parallel():

  https://public-inbox.org/git/1442012994-20374-1-git-send-email-sbeller@google.com/

In case it's useful, here's the latest version of the threaded version
that was posted to the list (I think):

  https://public-inbox.org/git/1440724495-708-5-git-send-email-sbeller@google.com/

> For those who want to know, this question was motivated by a big delta
> tree occurring in [1]. index-pack does parallelize delta resolution, but
> it cannot split up trees into threads: each delta base root can go into
> its own thread, but when a delta base root is processed, all deltas on
> that root (direct or indirect) is processed in the same thread. I took a
> look to see if this could be parallelized, but thought that it might be
> better to standardize on a thread pool implementation first.

This seems like a reasonable thing to want to do. It does mean that
threads have more potential to interact with each other. E.g., two might
use the same base now, so we'd have to make sure we don't throw away the
bytes of that base until both are done. As you pointed out to me
off-list, the bases are currently placed in a cache, but I think that
cache is currently per-thread. I guess we could "split" it and copy the
new base into a thread's cache when we give it work to do.

While thinking about this, an unrelated optimization occurred to me
(that I don't see us doing). Right now we put every base we resolve into
the cache. But if it only has one child delta, after we resolve that
delta, we can definitely remove the base (even if we're still resolving
deltas against that child). Or more generally, once we've resolved the
_final_ child that uses that base, we no longer need it. Here's an
illustration (where "<-" means "has delta child"):

  A <- B <- C
         <- D <- E <- F ...
	      <- G <- H ...

We'd visit each column in a depth-first way, looking for children of
each base we find. Once we've resolved C and D, we don't need to care
about the cache of B anymore. So freeing it puts more room in the cache
while processing E, F, etc.

And likewise, we can throw away D after finishing G. The strategy might
help more if we did a breadth-first traversal, but then you have the
opposite problem (instead of a deep tree busting your cache, a wide one
would).

Anyway, that's all pretty far off your original point. But it might be
worth thinking about it you have to touch this part of the code as part
of your threading work.

-Peff

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-09-27 22:14 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-26  0:33 Common thread pool API in Git? Jonathan Tan
2019-09-26  0:35 ` Jonathan Tan
2019-09-27 22:14 ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).