git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Konstantin Tokarev <annulen@yandex.ru>, Jeff King <peff@peff.net>
Cc: Jonathan Tan <jonathantanmy@google.com>,
	"git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: Inefficiency of partial shallow clone vs shallow clone + "old-style" sparse checkout
Date: Tue, 31 Mar 2020 20:09:26 -0400	[thread overview]
Message-ID: <0bf763ad-5f1d-65e2-bf3a-a4b7d5a7b3e3@gmail.com> (raw)
In-Reply-To: <4872731585693023@vla5-c5051da8689e.qloud-c.yandex.net>

On 3/31/2020 6:23 PM, Konstantin Tokarev wrote:
> 01.04.2020, 01:10, "Konstantin Tokarev" <annulen@yandex.ru>:
>> 28.03.2020, 19:58, "Derrick Stolee" <stolee@gmail.com>:
>>>  On 3/28/2020 10:40 AM, Jeff King wrote:
>>>>   On Sat, Mar 28, 2020 at 12:08:17AM +0300, Konstantin Tokarev wrote:
>>>>
>>>>>   Is it a known thing that addition of --filter=blob:none to workflow
>>>>>   with shalow clone (e.g. --depth=1) and following sparse checkout may
>>>>>   significantly slow down process and result in much larger .git
>>>>>   repository?
>>>
>>>  In general, I would recommend not using shallow clones in conjunction
>>>  with partial clone. The blob:none filter will get you what you really
>>>  want from shallow clone without any of the downsides of shallow clone.
>>
>> Is it really so?
>>
>> As you can see from my measurements [1], in my case simple shallow clone (1)
>> runs faster than simple partial clone (2) and produces slightly smaller .git,
>> from which I can infer that (2) downloads some data which is not downloaded
>> in (1).
> 
> Actually, as I have full git logs for all these cases, there is no need to be guessing:
>     (1) downloads 295085 git objects of total size 1.00 GiB
>     (2) downloads 1949129 git objects of total size 1.01 GiB

It is worth pointing out that these sizes are very close. The number of objects
may be part of why the timing is so different as the client needs to parse all
deltas to verify the object contents.

Re-running the test with GIT_TRACE2_PERF=1 might reveal some interesting info
about which regions are slower than others.

> Total sizes are very close, but (2) downloads much more objects, and also it uses
> 3 passes to download them which leads to less efficient use of network bandwidth.

Three passes, being:

1. Download commits and trees.
2. Initialize sparse-checkout with blobs at root.
3. Expand sparse-checkout.

Is that right? You could group 1 & 2 by setting your sparse-checkout patterns
before initializing a checkout (if you clone with --no-checkout). Your link
says you did this:

	git clone <mode> --no-checkout <url> <dir>
	git sparse-checkout init
	git sparse-checkout set '/*' '!LayoutTests'

Try doing it this way instead:

	git clone <mode> --no-checkout <url> <dir>
	git config core.sparseCheckout true
	git sparse-checkout set '/*' '!LayoutTests'

By doing it this way, you skip the step where the 'init' subcommand looks
for all blobs at root and does a network call for them. Should remove some
overhead.

Less efficient use of network bandwidth is one thing, but shallow clones are
also more CPU-intensive with the "counting objects" phase on the server. Your
link shares the following end-to-end timings:

* Shallow-clone: 234s
* Partial clone: 286s
* Both(???): 1023s

The data implies that by asking for both you actually got a full clone (4.1 GB).

The 234s to 286s difference is meaningful. Almost a minute.

>> To be clear, use case which I'm interested right now is checking out sources in
>> cloud CI system like GitHub Actions for one shot build. Right now checkout usually
>> takes 1-2 minutes and my hope was that someday in the future it would be possible\
>> to make it faster.

As long as you delete the shallow clone every time, then you also remove the
downsides of a shallow clone related to a later fetch or attempts to push.

If possible, a repo this size would benefit from persistent build agents that
you control. They can keep a copy of the repo around and do incremental fetches
that are much faster. It's a larger investment to run your own build lab, though.
But sometimes making builds faster is expensive. It depends on how "expensive" those
four minute clones per build are in terms of your team waiting.

Thanks,
-Stolee

  reply	other threads:[~2020-04-01  0:09 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-27 21:08 Inefficiency of partial shallow clone vs shallow clone + "old-style" sparse checkout Konstantin Tokarev
2020-03-28 14:40 ` Jeff King
2020-03-28 16:58   ` Derrick Stolee
2020-03-31 21:46     ` Taylor Blau
2020-04-01 12:18       ` Jeff King
2020-03-31 22:10     ` Konstantin Tokarev
2020-03-31 22:23       ` Konstantin Tokarev
2020-04-01  0:09         ` Derrick Stolee [this message]
2020-04-01  1:49           ` Konstantin Tokarev
2020-04-01 11:44             ` Jeff King
2020-04-01 12:15   ` [PATCH] clone: use "quick" lookup while following tags Jeff King
2020-04-01 19:12     ` Konstantin Tokarev
2020-04-01 19:25       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0bf763ad-5f1d-65e2-bf3a-a4b7d5a7b3e3@gmail.com \
    --to=stolee@gmail.com \
    --cc=annulen@yandex.ru \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).