On 2021-01-22 at 19:53:21, Emily Shaffer wrote:
> On Thu, Jan 21, 2021 at 7:01 PM William Chen <williamchen32335@gmail.com> wrote:
> > When I try to clone a repo of a large size from github, it is slow.
> >
> > $ git clone https://github.com/git/git
> > ...
> > remote: Enumerating objects: 56, done.
> > remote: Counting objects: 100% (56/56), done.
> > remote: Compressing objects: 100% (25/25), done.
> > Receiving objects:  23% (70386/299751), 33.00 MiB | 450.00 KiB/s
> >
> > The following aria2c command, which can use multiple downloading threads, is much faster. Would you please let me know whether there is a way to speed up git clone (maybe by using parallelization)?
> 
> In general, it would be more compelling to see actual numbers than
> "much faster", e.g. the outputs of `time git clone
> https://github.com/git/git` and `time aria2c
> https://github.com/git/git/archive/master.zip` - or even an estimation
> from you, like, "I think clone takes a minute or two but aria does the
> same thing in only a couple of seconds". "Much faster" means something
> different to everyone :)

When Git shows the download speed, I believe it shows the speed at the
given interval.  So it may be that at the given moment, performance
varies.  Part of that is server side, since Git performs compression of
data, and part of that is client side.  For instance, if you're using
SHA-1-DC (which you should, for security), there is a theoretical
performance limit for clones of about 50 MiB/s on the client side which
would be improved if you were cloning a SHA-256 repository[0].

If this is a fresh clone, it is probably already cached on the server
side, since frequently requested pack files are cached at GitHub
(although I'm not clear on whether a given request is cached can be
determined), so it could be that it's just pushing data over the pipe as
fast as your system can process it.

> > Your help is much appreciated! I look forward to hearing from you. Thanks.
> >
> > $ aria2c https://github.com/git/git/archive/master.zip
> >
> > 01/21 20:16:04 [NOTICE] Downloading 1 item(s)
> >
> > 01/21 20:16:04 [NOTICE] CUID#7 - Redirecting to https://codeload.github.com/git/git/zip/master
> 
> Right here it looks like your zip download redirects to a CDN or
> something, which is probably better optimized for serving archives
> than the Git server itself, so I would guess that has something to do
> with it too.

This is indeed backed by a CDN which may be much closer to you
physically.  Without seeing the full request, it's hard for me to say
where your request was served from (CDN or not).  I should point out
that in this case you're cloning a single revision (so much less data),
the data is usually cached, and your end of the system is not performing
any decompression or hash verification, so it may appear faster when
you're not performing equivalent work.

(I would kindly ask that you not try to download every revision in
history for a comparison, because that would be a clearly excessive and
abusive level of usage.)

I should also point out that you can't use multiple download threads to
download from these endpoints because they don't, in general, handle
Range requests.  (Basically, they do if they're already cached, but not
if they're not.)

> > [#59b6a2 8.2MiB/0B CN:1 DL:3.8MiB]
> > 01/21 20:16:08 [NOTICE] Download complete: /private/tmp/git-master.zip
> >
> > Download Results:
> > gid   |stat|avg speed  |path/URI
> > ======+====+===========+=======================================================
> > 59b6a2|OK  |   2.9MiB/s|/private/tmp/git-master.zip
> >
> > Status Legend:
> > (OK):download completed.
> 
> There are others on the list who are better able to explain this than
> me. But I'd guess the upshot is that 'git clone
> https://github.com/git/git' is asking a Git server, which is good at
> Git repo management (e.g. accepting pushes, generating packfiles to
> send you a specific object or branch, etc) - but when you ask for
> "git/git/archive/master.zip" you're getting the result of some work
> the Git server already did a while ago to zip up the current 'master'
> into an archive and give it to some other server.

It's impossible for me to say definitively what the performance problem
is in this case, but I don't think it's intrinsically Git if you're
seeing less than a 50 MiB/s speed.  Git can and does process data at
that speed on the local system (and at 15 MiB/s on my local network over
Wi-Fi), so I'd guess that it's either a limitation in network
performance based on two different serving locations or perhaps a
temporarily overloaded server combined with the packfile not being
cached.

[0] I personally think 50 MiB/s is a very reasonable transfer speed, but
some people disagree.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US