Re: Inefficiency of partial shallow clone vs shallow clone + "old-style" sparse checkout

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Konstantin Tokarev <annulen@yandex.ru>
To: Derrick Stolee <stolee@gmail.com>, Jeff King <peff@peff.net>
Cc: Jonathan Tan <jonathantanmy@google.com>,
	"git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: Inefficiency of partial shallow clone vs shallow clone + "old-style" sparse checkout
Date: Wed, 01 Apr 2020 04:49:20 +0300	[thread overview]
Message-ID: <8268671585700012@iva3-58091f505f14.qloud-c.yandex.net> (raw)
In-Reply-To: <0bf763ad-5f1d-65e2-bf3a-a4b7d5a7b3e3@gmail.com>



01.04.2020, 03:09, "Derrick Stolee" <stolee@gmail.com>:
> On 3/31/2020 6:23 PM, Konstantin Tokarev wrote:
>>  01.04.2020, 01:10, "Konstantin Tokarev" <annulen@yandex.ru>:
>>>  28.03.2020, 19:58, "Derrick Stolee" <stolee@gmail.com>:
>>>>   On 3/28/2020 10:40 AM, Jeff King wrote:
>>>>>    On Sat, Mar 28, 2020 at 12:08:17AM +0300, Konstantin Tokarev wrote:
>>>>>
>>>>>>    Is it a known thing that addition of --filter=blob:none to workflow
>>>>>>    with shalow clone (e.g. --depth=1) and following sparse checkout may
>>>>>>    significantly slow down process and result in much larger .git
>>>>>>    repository?
>>>>
>>>>   In general, I would recommend not using shallow clones in conjunction
>>>>   with partial clone. The blob:none filter will get you what you really
>>>>   want from shallow clone without any of the downsides of shallow clone.
>>>
>>>  Is it really so?
>>>
>>>  As you can see from my measurements [1], in my case simple shallow clone (1)
>>>  runs faster than simple partial clone (2) and produces slightly smaller .git,
>>>  from which I can infer that (2) downloads some data which is not downloaded
>>>  in (1).
>>
>>  Actually, as I have full git logs for all these cases, there is no need to be guessing:
>>      (1) downloads 295085 git objects of total size 1.00 GiB
>>      (2) downloads 1949129 git objects of total size 1.01 GiB
>
> It is worth pointing out that these sizes are very close. The number of objects
> may be part of why the timing is so different as the client needs to parse all
> deltas to verify the object contents.
>
> Re-running the test with GIT_TRACE2_PERF=1 might reveal some interesting info
> about which regions are slower than others.

Here are trace results for (1) with fix discussed below:
https://gist.github.com/annulen/58b868e35e992105e7028946a8370795

Here are trace results for (2) with fix discussed below:
https://gist.github.com/annulen/fa1ef1b5d1056e6dede815e9ebf85c03

>
>>  Total sizes are very close, but (2) downloads much more objects, and also it uses
>>  3 passes to download them which leads to less efficient use of network bandwidth.
>
> Three passes, being:
>
> 1. Download commits and trees.
> 2. Initialize sparse-checkout with blobs at root.
> 3. Expand sparse-checkout.
>
> Is that right? You could group 1 & 2 by setting your sparse-checkout patterns
> before initializing a checkout (if you clone with --no-checkout). Your link
> says you did this:
>
>         git clone <mode> --no-checkout <url> <dir>
>         git sparse-checkout init
>         git sparse-checkout set '/*' '!LayoutTests'
>
> Try doing it this way instead:
>
>         git clone <mode> --no-checkout <url> <dir>
>         git config core.sparseCheckout true
>         git sparse-checkout set '/*' '!LayoutTests'
>
> By doing it this way, you skip the step where the 'init' subcommand looks
> for all blobs at root and does a network call for them. Should remove some
> overhead.

Thanks, that helped. Now git downloads object only two times.

From reading man page I assumed that `git sparse-checkout init` should do
the same as `git config core.sparseCheckout true`, unless `--cone` argument
is specified.

>
> Less efficient use of network bandwidth is one thing, but shallow clones are
> also more CPU-intensive with the "counting objects" phase on the server. Your
> link shares the following end-to-end timings:
>
> * Shallow-clone: 234s
> * Partial clone: 286s
> * Both(???): 1023s
>
> The data implies that by asking for both you actually got a full clone (4.1 GB).

No, this is still a partial clone, full clone takes more than 6 GB

>
> The 234s to 286s difference is meaningful. Almost a minute.
>
>>>  To be clear, use case which I'm interested right now is checking out sources in
>>>  cloud CI system like GitHub Actions for one shot build. Right now checkout usually
>>>  takes 1-2 minutes and my hope was that someday in the future it would be possible\
>>>  to make it faster.
>
> As long as you delete the shallow clone every time, then you also remove the
> downsides of a shallow clone related to a later fetch or attempts to push.
>
> If possible, a repo this size would benefit from persistent build agents that
> you control. They can keep a copy of the repo around and do incremental fetches
> that are much faster. It's a larger investment to run your own build lab, though.
> But sometimes making builds faster is expensive. It depends on how "expensive" those
> four minute clones per build are in terms of your team waiting.

No, current checkout times for shallow clone + sparseCheckout are quite acceptable.
(FWIW, initially I used shallow clone without sparseCheckout, as the latter is not supported
by GitHub Actions out of the box, and those times were NOT acceptable, as depending on
server load checkout could take 16 minutes or even more)

For now it's just my curioisty and desire to provide info which could make Git better.
It just seemed logical to me initially that if we limit both required paths in worktree and history
depth on object download stage, it should be more efficient than limiting only  history depth
(or, at least, have the same efficiency).

BTW, I did more measurements and results seem to be highly dependent on server side.
Once partial clone (2) even worked faster than shallow clone + sparseCheckout (1). Still, most of
the time (1) is faster than (2).

-- 
Regards,
Konstantin

next prev parent reply	other threads:[~2020-04-01  1:49 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-27 21:08 Inefficiency of partial shallow clone vs shallow clone + "old-style" sparse checkout Konstantin Tokarev
2020-03-28 14:40 ` Jeff King
2020-03-28 16:58   ` Derrick Stolee
2020-03-31 21:46     ` Taylor Blau
2020-04-01 12:18       ` Jeff King
2020-03-31 22:10     ` Konstantin Tokarev
2020-03-31 22:23       ` Konstantin Tokarev
2020-04-01  0:09         ` Derrick Stolee
2020-04-01  1:49           ` Konstantin Tokarev [this message]
2020-04-01 11:44             ` Jeff King
2020-04-01 12:15   ` [PATCH] clone: use "quick" lookup while following tags Jeff King
2020-04-01 19:12     ` Konstantin Tokarev
2020-04-01 19:25       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8268671585700012@iva3-58091f505f14.qloud-c.yandex.net \
    --to=annulen@yandex.ru \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).