* Parallelism for submodule update @ 2023-01-02 16:44 Zitzmann, Christian 2023-01-02 16:54 ` rsbecker 2023-01-19 21:39 ` Calvin Wan 0 siblings, 2 replies; 4+ messages in thread From: Zitzmann, Christian @ 2023-01-02 16:44 UTC (permalink / raw) To: git@vger.kernel.org Hello, we are using git since many years with also heavily using submodules. When updating the submodules, only the fetching part is done in parallel (with config submodule.fetchjobs or --jobs) but the checkout is done sequentially What I’ve recognized when cloning with - scalar clone --full-clone --recurse-submodules <URL> or - git clone --filter=blob:none --also-filter-submodules --recurse-submodules <URL> We loose performance, as the fetch of the blobs is done in the sequential checkout part, instead of in the parallel part. Furthermore, the utilization - without partial clone - of network and harddisk is not always good, as first the network is utilized (fetch) and then the harddisk (checkout) As the checkout part is local to the submodule (no shared resources to block), it would be great if we could move the checkout into the parallelized part. E.g. by doing fetch and checkout (with blob fetching) in one step with e.g. run_processes_parallel_tr2 I expect that this significantly improves the performance, especially when using partial clones. Do you think this is possible? Do I miss anything in my thoughts? Best regards, Christian Zitzmann ^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: Parallelism for submodule update 2023-01-02 16:44 Parallelism for submodule update Zitzmann, Christian @ 2023-01-02 16:54 ` rsbecker 2023-01-13 10:49 ` Zitzmann, Christian 2023-01-19 21:39 ` Calvin Wan 1 sibling, 1 reply; 4+ messages in thread From: rsbecker @ 2023-01-02 16:54 UTC (permalink / raw) To: 'Zitzmann, Christian', git >-----Original Message----- >From: <Christian.Zitzmann@vitesco.com> On January 2, 2023 11:45 AM Christian Zitzmann wrote: >we are using git since many years with also heavily using submodules. > >When updating the submodules, only the fetching part is done in parallel (with >config submodule.fetchjobs or --jobs) but the checkout is done sequentially > >What I’ve recognized when cloning with >- scalar clone --full-clone --recurse-submodules <URL> or >- git clone --filter=blob:none --also-filter-submodules --recurse-submodules ><URL> > >We loose performance, as the fetch of the blobs is done in the sequential >checkout part, instead of in the parallel part. > >Furthermore, the utilization - without partial clone - of network and harddisk is not >always good, as first the network is utilized (fetch) and then the harddisk >(checkout) > >As the checkout part is local to the submodule (no shared resources to block), it >would be great if we could move the checkout into the parallelized part. >E.g. by doing fetch and checkout (with blob fetching) in one step with e.g. >run_processes_parallel_tr2 > >I expect that this significantly improves the performance, especially when using >partial clones. > >Do you think this is possible? Do I miss anything in my thoughts? Since this is a platform-specific request, if it happens, this should be a configuration switch that defaults off. On my platform, the file system itself is fairly fast, but the name service traversals and resolutions (what happens in the name service) is a performance problem. Doing the checkout/switch in parallel would actually be counter-productive in my case. So I would keep it off, but I get that other platforms could benefit. Regards, Randall -- Brief whoami: NonStop&UNIX developer since approximately UNIX(421664400) NonStop(211288444200000000) -- In real life, I talk too much. ^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: Parallelism for submodule update 2023-01-02 16:54 ` rsbecker @ 2023-01-13 10:49 ` Zitzmann, Christian 0 siblings, 0 replies; 4+ messages in thread From: Zitzmann, Christian @ 2023-01-13 10:49 UTC (permalink / raw) To: rsbecker@nexbridge.com, git@vger.kernel.org Hello Randall, Yes, I guess this is a quite common that the harddisk is much faster than the Network Services. With the scalar strategy (e.g. blobless clones) the checkout phase does not involve mainly harddisk activity anymore, but includes fetching sources from the remote. So it consumes a lot of Network Services. Especially with parallelism we would gain a lot of performance here, as network and harddisk are utilized in parallel. This could be even a general strategy (without using blobless clones) to have fetch and checkout done together, but both in an parallel Scheme Currently it's like this: Multithreading (mainly network utilization, only small amount of data - Commits and Trees -) Thread1: Fetch submodule1 -> NETWORK Thread2: Fetch submodule2 -> NETWORK --- Thread<x>: Fetch Submodule<n> -> NETWORK Sequential (alternating harddisk and network utilization) Loop1 Try to Checkout Submodule1 commit -> HARDDISK Fetch missing objects (e.g. Blobs - big amount of Data) -> NETWORK Checkout Submodule commit -> HARDDISK Loop2 Try to Checkout Submodule1 commit -> HARDDISK Fetch missing objects (e.g. Blobs - big amount of Data) -> NETWORK Checkout Submodule commit -> HARDDISK ... Loop<n> Try to Checkout Submodule<n> commit -> HARDDISK Fetch missing objects (e.g. Blobs - big amount of Data) -> NETWORK Checkout Submodule commit -> HARDDISK Here the Network accesses in the sequential part really have significant waiting times (e.g. name service) with low local resources utilization The proposal is to change it to a full parallel flow! Multithreading (both network and harddisk are utilized all the time) Thread1: Fetch submodule1 (blobless) -> NETWORK Try to Checkout Submodule commit -> HARDDISK Fetch missing objects -> NETWORK Checkout Submodule commit -> HARDDISK Thread2: Fetch submodule2 (blobless) -> NETWORK Try to Checkout Submodule commit -> HARDDISK Fetch missing objects -> NETWORK Checkout Submodule commit... -> HARDDISK ... Thread<x> The only negative effect I'd see when having very slow harddisks, or disks that suffer significantly from parallel access, the overall performance could also suffer. In general in the partial clone, but even in the full clone approach, network and harddisk utilization will be in parallel, and therefore performance can increase. Best regards Christian -----Original Message----- From: rsbecker@nexbridge.com <rsbecker@nexbridge.com> Sent: Montag, 2. Januar 2023 17:54 To: Zitzmann, Christian <Christian.Zitzmann@vitesco.com>; git@vger.kernel.org Subject: RE: Parallelism for submodule update [Sie erhalten nicht häufig E-Mails von rsbecker@nexbridge.com. Weitere Informationen, warum dies wichtig ist, finden Sie unter https://aka.ms/LearnAboutSenderIdentification ] >-----Original Message----- >From: <Christian.Zitzmann@vitesco.com> On January 2, 2023 11:45 AM Christian Zitzmann wrote: >we are using git since many years with also heavily using submodules. > >When updating the submodules, only the fetching part is done in >parallel (with config submodule.fetchjobs or --jobs) but the checkout >is done sequentially > >What I’ve recognized when cloning with >- scalar clone --full-clone --recurse-submodules <URL> or >- git clone --filter=blob:none --also-filter-submodules >--recurse-submodules <URL> > >We loose performance, as the fetch of the blobs is done in the >sequential checkout part, instead of in the parallel part. > >Furthermore, the utilization - without partial clone - of network and >harddisk is not always good, as first the network is utilized (fetch) >and then the harddisk >(checkout) > >As the checkout part is local to the submodule (no shared resources to >block), it would be great if we could move the checkout into the parallelized part. >E.g. by doing fetch and checkout (with blob fetching) in one step with e.g. >run_processes_parallel_tr2 > >I expect that this significantly improves the performance, especially >when using partial clones. > >Do you think this is possible? Do I miss anything in my thoughts? Since this is a platform-specific request, if it happens, this should be a configuration switch that defaults off. On my platform, the file system itself is fairly fast, but the name service traversals and resolutions (what happens in the name service) is a performance problem. Doing the checkout/switch in parallel would actually be counter-productive in my case. So I would keep it off, but I get that other platforms could benefit. Regards, Randall -- Brief whoami: NonStop&UNIX developer since approximately UNIX(421664400) NonStop(211288444200000000) -- In real life, I talk too much. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Parallelism for submodule update 2023-01-02 16:44 Parallelism for submodule update Zitzmann, Christian 2023-01-02 16:54 ` rsbecker @ 2023-01-19 21:39 ` Calvin Wan 1 sibling, 0 replies; 4+ messages in thread From: Calvin Wan @ 2023-01-19 21:39 UTC (permalink / raw) To: Christian.Zitzmann; +Cc: Calvin Wan, git@vger.kernel.org Hi Christian, I investigated this as well about 2 months ago and am happy to share my findings with you :) > When updating the submodules, only the fetching part is done in parallel (with config submodule.fetchjobs or --jobs) but the checkout is done sequentially Correct. > What I’ve recognized when cloning with > - scalar clone --full-clone --recurse-submodules <URL> > or > - git clone --filter=blob:none --also-filter-submodules --recurse-submodules <URL> > > We loose performance, as the fetch of the blobs is done in the sequential checkout part, instead of in the parallel part. > > Furthermore, the utilization - without partial clone - of network and harddisk is not always good, as first the network is utilized (fetch) and then the harddisk (checkout) Also an astute observation that separating out the parallelization of fetch and checkout doesn't allow us to fully use our resources. > As the checkout part is local to the submodule (no shared resources to block), it would be great if we could move the checkout into the parallelized part. > E.g. by doing fetch and checkout (with blob fetching) in one step with e.g. run_processes_parallel_tr2 > > I expect that this significantly improves the performance, especially when using partial clones. > > Do you think this is possible? Do I miss anything in my thoughts? Sort of. The issue with run_processes_parallel_tr2 is that it creates a subprocess with a git command. There is no git command that we can call that lets us do both the correct fetch and checkout command, so first you would have to create a new option/command for that (and what happens if we want to add to that parallelization in the future? Create another option/command?). I think we can do better than that! `git submodule update`, when called from clone, essentially does 4 things to the submodule: init, clone, checkout, and recursively calls itself for child submodules. One idea I had was to separate out the individual tasks that `git submodule update` does and create a new submodule--helper command (eg. git submodule--helper update-helper) that calls those individual tasks. Then, clone would directly call run_processes_parallel_tr2 with the new submodule--helper command and each process separated by submodule. This is what I imagine the general idea of what `git clone --recurse-submodules` would look like: superproject cloning run_processes_parallel_tr2(git submodule--helper update-helper) Init Clone Checkout Recursive git submdodule update-helper I'll discuss what I think are the benefits of this approach: - The entirety of submodule update would be parallelized so network and hard disk resources can be used together - There only needs to be one config option that controls how many parallel processes to spawn - Any new features to submodule update are automatically parallelized The drawback is that any new feature that would cause a race condition if run in parallel would have to have additional locking code written for it since separating it out would be difficult. In this case, only adding lines to .gitmodules in init is at risk of a race condition, but fortunately that can be handled first in series before running everything else in parallel. I haven't started implementing this and am not planning to fix this in the near future. This is because we are planning a more long-term solution (2y+) to solve problems like this (notice how much simpler it would've been to add parallelization if we didn't have to create subprocesses for every separate git command and instead could call from a variety of library functions). So if you need the parallelizations sooner or want to scratch your itch, you're more than welcome to implement it. Happy to bounce ideas off of and review any patches for this! Thanks, Calvin ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-01-19 22:01 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-01-02 16:44 Parallelism for submodule update Zitzmann, Christian 2023-01-02 16:54 ` rsbecker 2023-01-13 10:49 ` Zitzmann, Christian 2023-01-19 21:39 ` Calvin Wan
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).