Parallelism for submodule update

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Parallelism for submodule update
@ 2023-01-02 16:44 Zitzmann, Christian
  2023-01-02 16:54 ` rsbecker
  2023-01-19 21:39 ` Calvin Wan
  0 siblings, 2 replies; 4+ messages in thread
From: Zitzmann, Christian @ 2023-01-02 16:44 UTC (permalink / raw)
  To: git@vger.kernel.org

Hello,
we are using git since many years with also heavily using submodules. 

When updating the submodules, only the fetching part is done in parallel (with config submodule.fetchjobs or --jobs) but the checkout is done sequentially

What I’ve recognized when cloning with
- scalar clone --full-clone --recurse-submodules <URL>
or
- git clone --filter=blob:none --also-filter-submodules --recurse-submodules <URL>

We loose performance, as the fetch of the blobs is done in the sequential checkout part, instead of in the parallel part.

Furthermore, the utilization - without partial clone - of network and harddisk is not always good, as first the network is utilized (fetch) and then the harddisk (checkout)

As the checkout part is local to the submodule (no shared resources to block), it would be great if we could move the checkout into the parallelized part.
E.g. by doing fetch and checkout (with blob fetching) in one step with e.g. run_processes_parallel_tr2

I expect that this significantly improves the performance, especially when using partial clones.

Do you think this is possible? Do I miss anything in my thoughts?

Best regards,

Christian Zitzmann

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Parallelism for submodule update
  2023-01-02 16:44 Parallelism for submodule update Zitzmann, Christian
@ 2023-01-02 16:54 ` rsbecker
  2023-01-13 10:49   ` Zitzmann, Christian
  2023-01-19 21:39 ` Calvin Wan
  1 sibling, 1 reply; 4+ messages in thread
From: rsbecker @ 2023-01-02 16:54 UTC (permalink / raw)
  To: 'Zitzmann, Christian', git



>-----Original Message-----
>From: <Christian.Zitzmann@vitesco.com>
On January 2, 2023 11:45 AM Christian Zitzmann wrote:
>we are using git since many years with also heavily using submodules.
>
>When updating the submodules, only the fetching part is done in parallel (with
>config submodule.fetchjobs or --jobs) but the checkout is done sequentially
>
>What I’ve recognized when cloning with
>- scalar clone --full-clone --recurse-submodules <URL> or
>- git clone --filter=blob:none --also-filter-submodules --recurse-submodules
><URL>
>
>We loose performance, as the fetch of the blobs is done in the sequential
>checkout part, instead of in the parallel part.
>
>Furthermore, the utilization - without partial clone - of network and harddisk is not
>always good, as first the network is utilized (fetch) and then the harddisk
>(checkout)
>
>As the checkout part is local to the submodule (no shared resources to block), it
>would be great if we could move the checkout into the parallelized part.
>E.g. by doing fetch and checkout (with blob fetching) in one step with e.g.
>run_processes_parallel_tr2
>
>I expect that this significantly improves the performance, especially when using
>partial clones.
>
>Do you think this is possible? Do I miss anything in my thoughts?

Since this is a platform-specific request, if it happens, this should be a configuration switch that defaults off. On my platform, the file system itself is fairly fast, but the name service traversals and resolutions (what happens in the name service) is a performance problem. Doing the checkout/switch in parallel would actually be counter-productive in my case. So I would keep it off, but I get that other platforms could benefit.

Regards,
Randall

--
Brief whoami: NonStop&UNIX developer since approximately
UNIX(421664400)
NonStop(211288444200000000)
-- In real life, I talk too much.




^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Parallelism for submodule update
  2023-01-02 16:54 ` rsbecker
@ 2023-01-13 10:49   ` Zitzmann, Christian
  0 siblings, 0 replies; 4+ messages in thread
From: Zitzmann, Christian @ 2023-01-13 10:49 UTC (permalink / raw)
  To: rsbecker@nexbridge.com, git@vger.kernel.org

Hello Randall,
Yes, I guess this is a quite common that the harddisk is much faster than the Network Services.
With the scalar strategy (e.g. blobless clones) the checkout phase does not involve mainly harddisk activity anymore, but includes fetching sources from the remote. So it consumes a lot of Network Services.
Especially with parallelism we would gain a lot of performance here, as network and harddisk are utilized in parallel.
This could be even a general strategy (without using blobless clones) to have fetch and checkout done together, but both in an parallel Scheme

Currently it's like this:

Multithreading (mainly network utilization, only small amount of data - Commits and Trees -)
	Thread1: Fetch submodule1                       -> NETWORK
	Thread2: Fetch submodule2                       -> NETWORK
	---
	Thread<x>: Fetch Submodule<n>               -> NETWORK

Sequential (alternating harddisk and network utilization)
	Loop1
		Try to Checkout Submodule1 commit                                      -> HARDDISK
		Fetch missing objects (e.g. Blobs - big amount of Data)        -> NETWORK
		Checkout Submodule commit                                                    -> HARDDISK
	Loop2		
		Try to Checkout Submodule1 commit                                      -> HARDDISK
		Fetch missing objects (e.g. Blobs - big amount of Data)        -> NETWORK
		Checkout Submodule commit                                                    -> HARDDISK

		...
	Loop<n>
		Try to Checkout Submodule<n> commit                                  -> HARDDISK
		Fetch missing objects (e.g. Blobs - big amount of Data)        -> NETWORK
		Checkout Submodule commit                                                    -> HARDDISK

Here the Network accesses in the sequential part really have significant waiting times (e.g. name service) with low local resources utilization

The proposal is to change it to a full parallel flow!

Multithreading (both network and harddisk are utilized all the time)
	Thread1:
		Fetch submodule1 (blobless)               -> NETWORK
		Try to Checkout Submodule commit   -> HARDDISK
		Fetch missing objects                            -> NETWORK
		Checkout Submodule commit              -> HARDDISK
	Thread2:
		Fetch submodule2 (blobless)                -> NETWORK
		Try to Checkout Submodule commit   -> HARDDISK
		Fetch missing objects                            -> NETWORK
		Checkout Submodule commit...           -> HARDDISK
	...

	Thread<x>

The only negative effect I'd see when having very slow harddisks, or disks that suffer significantly from parallel access, the overall performance could also suffer.

In general in the partial clone, but even in the full clone approach, network and harddisk utilization will be in parallel, and therefore performance can increase.

Best regards

Christian

-----Original Message-----
From: rsbecker@nexbridge.com <rsbecker@nexbridge.com> 
Sent: Montag, 2. Januar 2023 17:54
To: Zitzmann, Christian <Christian.Zitzmann@vitesco.com>; git@vger.kernel.org
Subject: RE: Parallelism for submodule update 

[Sie erhalten nicht häufig E-Mails von rsbecker@nexbridge.com. Weitere Informationen, warum dies wichtig ist, finden Sie unter https://aka.ms/LearnAboutSenderIdentification ]

>-----Original Message-----
>From: <Christian.Zitzmann@vitesco.com>
On January 2, 2023 11:45 AM Christian Zitzmann wrote:
>we are using git since many years with also heavily using submodules.
>
>When updating the submodules, only the fetching part is done in 
>parallel (with config submodule.fetchjobs or --jobs) but the checkout 
>is done sequentially
>
>What I’ve recognized when cloning with
>- scalar clone --full-clone --recurse-submodules <URL> or
>- git clone --filter=blob:none --also-filter-submodules 
>--recurse-submodules <URL>
>
>We loose performance, as the fetch of the blobs is done in the 
>sequential checkout part, instead of in the parallel part.
>
>Furthermore, the utilization - without partial clone - of network and 
>harddisk is not always good, as first the network is utilized (fetch) 
>and then the harddisk
>(checkout)
>
>As the checkout part is local to the submodule (no shared resources to 
>block), it would be great if we could move the checkout into the parallelized part.
>E.g. by doing fetch and checkout (with blob fetching) in one step with e.g.
>run_processes_parallel_tr2
>
>I expect that this significantly improves the performance, especially 
>when using partial clones.
>
>Do you think this is possible? Do I miss anything in my thoughts?

Since this is a platform-specific request, if it happens, this should be a configuration switch that defaults off. On my platform, the file system itself is fairly fast, but the name service traversals and resolutions (what happens in the name service) is a performance problem. Doing the checkout/switch in parallel would actually be counter-productive in my case. So I would keep it off, but I get that other platforms could benefit.

Regards,
Randall

--
Brief whoami: NonStop&UNIX developer since approximately
UNIX(421664400)
NonStop(211288444200000000)
-- In real life, I talk too much.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Parallelism for submodule update
  2023-01-02 16:44 Parallelism for submodule update Zitzmann, Christian
  2023-01-02 16:54 ` rsbecker
@ 2023-01-19 21:39 ` Calvin Wan
  1 sibling, 0 replies; 4+ messages in thread
From: Calvin Wan @ 2023-01-19 21:39 UTC (permalink / raw)
  To: Christian.Zitzmann; +Cc: Calvin Wan, git@vger.kernel.org

Hi Christian,

I investigated this as well about 2 months ago and am happy to share my
findings with you :)

> When updating the submodules, only the fetching part is done in parallel (with config submodule.fetchjobs or --jobs) but the checkout is done sequentially

Correct.

> What I’ve recognized when cloning with
> - scalar clone --full-clone --recurse-submodules <URL>
> or
> - git clone --filter=blob:none --also-filter-submodules --recurse-submodules <URL>
> 
> We loose performance, as the fetch of the blobs is done in the sequential checkout part, instead of in the parallel part.
> 
> Furthermore, the utilization - without partial clone - of network and harddisk is not always good, as first the network is utilized (fetch) and then the harddisk (checkout)

Also an astute observation that separating out the parallelization of
fetch and checkout doesn't allow us to fully use our resources.

> As the checkout part is local to the submodule (no shared resources to block), it would be great if we could move the checkout into the parallelized part.
> E.g. by doing fetch and checkout (with blob fetching) in one step with e.g. run_processes_parallel_tr2
> 
> I expect that this significantly improves the performance, especially when using partial clones.
> 
> Do you think this is possible? Do I miss anything in my thoughts?

Sort of. The issue with run_processes_parallel_tr2 is that it creates a
subprocess with a git command. There is no git command that we can call
that lets us do both the correct fetch and checkout command, so first
you would have to create a new option/command for that (and what happens
if we want to add to that parallelization in the future? Create another
option/command?). I think we can do better than that!

`git submodule update`, when called from clone, essentially does 4
things to the submodule: init, clone, checkout, and recursively calls
itself for child submodules. One idea I had was to separate out the
individual tasks that `git submodule update` does and create a new
submodule--helper command (eg. git submodule--helper update-helper) that
calls those individual tasks. Then, clone would directly call
run_processes_parallel_tr2 with the new submodule--helper command and
each process separated by submodule.

This is what I imagine the general idea of what
`git clone --recurse-submodules` would look like:
superproject cloning
run_processes_parallel_tr2(git submodule--helper update-helper)
        Init
        Clone
        Checkout
        Recursive git submdodule update-helper

I'll discuss what I think are the benefits of this approach:
- The entirety of submodule update would be parallelized so network and
  hard disk resources can be used together
- There only needs to be one config option that controls how many
  parallel processes to spawn
- Any new features to submodule update are automatically parallelized

The drawback is that any new feature that would cause a race condition
if run in parallel would have to have additional locking code written
for it since separating it out would be difficult. In this case, only
adding lines to .gitmodules in init is at risk of a race condition, but
fortunately that can be handled first in series before running
everything else in parallel.

I haven't started implementing this and am not planning to fix this in
the near future. This is because we are planning a more long-term
solution (2y+) to solve problems like this (notice how much simpler it
would've been to add parallelization if we didn't have to create
subprocesses for every separate git command and instead could call from
a variety of library functions). So if you need the parallelizations
sooner or want to scratch your itch, you're more than welcome to
implement it. Happy to bounce ideas off of and review any patches for
this!

Thanks,
Calvin

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-01-19 22:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-02 16:44 Parallelism for submodule update Zitzmann, Christian
2023-01-02 16:54 ` rsbecker
2023-01-13 10:49   ` Zitzmann, Christian
2023-01-19 21:39 ` Calvin Wan

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).