On 2021-01-22 at 19:53:21, Emily Shaffer wrote: > On Thu, Jan 21, 2021 at 7:01 PM William Chen wrote: > > When I try to clone a repo of a large size from github, it is slow. > > > > $ git clone https://github.com/git/git > > ... > > remote: Enumerating objects: 56, done. > > remote: Counting objects: 100% (56/56), done. > > remote: Compressing objects: 100% (25/25), done. > > Receiving objects: 23% (70386/299751), 33.00 MiB | 450.00 KiB/s > > > > The following aria2c command, which can use multiple downloading threads, is much faster. Would you please let me know whether there is a way to speed up git clone (maybe by using parallelization)? > > In general, it would be more compelling to see actual numbers than > "much faster", e.g. the outputs of `time git clone > https://github.com/git/git` and `time aria2c > https://github.com/git/git/archive/master.zip` - or even an estimation > from you, like, "I think clone takes a minute or two but aria does the > same thing in only a couple of seconds". "Much faster" means something > different to everyone :) When Git shows the download speed, I believe it shows the speed at the given interval. So it may be that at the given moment, performance varies. Part of that is server side, since Git performs compression of data, and part of that is client side. For instance, if you're using SHA-1-DC (which you should, for security), there is a theoretical performance limit for clones of about 50 MiB/s on the client side which would be improved if you were cloning a SHA-256 repository[0]. If this is a fresh clone, it is probably already cached on the server side, since frequently requested pack files are cached at GitHub (although I'm not clear on whether a given request is cached can be determined), so it could be that it's just pushing data over the pipe as fast as your system can process it. > > Your help is much appreciated! I look forward to hearing from you. Thanks. > > > > $ aria2c https://github.com/git/git/archive/master.zip > > > > 01/21 20:16:04 [NOTICE] Downloading 1 item(s) > > > > 01/21 20:16:04 [NOTICE] CUID#7 - Redirecting to https://codeload.github.com/git/git/zip/master > > Right here it looks like your zip download redirects to a CDN or > something, which is probably better optimized for serving archives > than the Git server itself, so I would guess that has something to do > with it too. This is indeed backed by a CDN which may be much closer to you physically. Without seeing the full request, it's hard for me to say where your request was served from (CDN or not). I should point out that in this case you're cloning a single revision (so much less data), the data is usually cached, and your end of the system is not performing any decompression or hash verification, so it may appear faster when you're not performing equivalent work. (I would kindly ask that you not try to download every revision in history for a comparison, because that would be a clearly excessive and abusive level of usage.) I should also point out that you can't use multiple download threads to download from these endpoints because they don't, in general, handle Range requests. (Basically, they do if they're already cached, but not if they're not.) > > [#59b6a2 8.2MiB/0B CN:1 DL:3.8MiB] > > 01/21 20:16:08 [NOTICE] Download complete: /private/tmp/git-master.zip > > > > Download Results: > > gid |stat|avg speed |path/URI > > ======+====+===========+======================================================= > > 59b6a2|OK | 2.9MiB/s|/private/tmp/git-master.zip > > > > Status Legend: > > (OK):download completed. > > There are others on the list who are better able to explain this than > me. But I'd guess the upshot is that 'git clone > https://github.com/git/git' is asking a Git server, which is good at > Git repo management (e.g. accepting pushes, generating packfiles to > send you a specific object or branch, etc) - but when you ask for > "git/git/archive/master.zip" you're getting the result of some work > the Git server already did a while ago to zip up the current 'master' > into an archive and give it to some other server. It's impossible for me to say definitively what the performance problem is in this case, but I don't think it's intrinsically Git if you're seeing less than a 50 MiB/s speed. Git can and does process data at that speed on the local system (and at 15 MiB/s on my local network over Wi-Fi), so I'd guess that it's either a limitation in network performance based on two different serving locations or perhaps a temporarily overloaded server combined with the packfile not being cached. [0] I personally think 50 MiB/s is a very reasonable transfer speed, but some people disagree. -- brian m. carlson (he/him or they/them) Houston, Texas, US