git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <derrickstolee@github.com>
To: 程洋 <chengyang@xiaomi.com>, "git@vger.kernel.org" <git@vger.kernel.org>
Cc: 何浩 <hehao@xiaomi.com>, "Xin7 Ma 马鑫" <maxin7@xiaomi.com>,
	石奉兵 <shifengbing@xiaomi.com>, 凡军辉 <fanjunhui@xiaomi.com>,
	王汉基 <wanghanji@xiaomi.com>
Subject: Re: Partial-clone cause big performance impact on server
Date: Fri, 12 Aug 2022 08:21:54 -0400	[thread overview]
Message-ID: <16633d89-6ccd-859d-8533-9861ad831c45@github.com> (raw)
In-Reply-To: <bfa3de4485614badb4a27d8cfba99968@xiaomi.com>

On 8/11/22 4:09 AM, 程洋 wrote:> Hi.
>      We observed big disk space save by partial-clone and require all of our users (2000+) to clone repository with partial-clone (filter=blob:none)
>      However at busy time, we found it's extremely slow for user to fetch. Here is what we did.
> 
>     1. ask all users to fetch with filter=blob:none. And it's remarkable. Now our download size per user decrease from 460G to 180G.

I hope this includes the blob download during the initial checkout,
because otherwise you have a very strange shape to make your commits and
trees take up 180 GB.

>     2. But at busy time, everyone's fetch become slow. (at idle hours, it takes us 5 minutes to clone a big repositories, but it takes more than 1 hour to clone the same repositories at busy hours)
>     3. with GIT_TRACE_PACKET=1. We found on big repositories (200K+refs, 6m+ objects). Git will sends 40k want.

You only have six million objects in the repo and yet have that size? It
must be some very large blobs.

>     4. And we then track our server(which is gerrit with jgit). We found the server is couting objects. Then we check those 40k objects, most of them are blobs rather than commit. (which means they're not in bitmap)

Are you seeing any commits in these requests? If the Git client is asking
for blobs, then they should not be mixed with commit wants. What kind of
operation are you doing to see these mixed wants?

If the request was only blobs, then the server should not need a "Counting
objects" phase. It should jump immediately to preparing the objects (which
will likely require parsing deltas, and that can be expensive). I don't
know if JGit is doing something different, though.

>     5. We believe that's the root cause of our problem. Git sends too many "want SHA1" which are not in bitmap, cause the server to count objects  frequently, which then slow down the server.
> 
> What we want is, download the things we need to checkout to specific commit. But if one commit contain so many objects (like us , 40k+). It takes more time to counting than downloading.

One thing that the microsoft/git fork uses in its "git-gvfs-helper" tool
(which speaks the GVFS Protocol as a replacement for partial clone when
using Azure Repos as a server) is a batched download of missing objects [1].
The initial limit is 4000 objects at a time, but that helps keep each
request small enough that it is less likely to fail for scale reasons alone.

[1] https://github.com/microsoft/git/blob/vfs-2.37.1/gvfs-helper.c#L3510-L3520

It might be interesting to create such batch-downloads for these partial
clone blob-fetches.

Thanks,
-Stolee

  parent reply	other threads:[~2022-08-12 12:22 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-11  8:09 Partial-clone cause big performance impact on server 程洋
2022-08-11 17:22 ` Jonathan Tan
2022-08-13  7:55   ` 回复: [External Mail]Re: " 程洋
2022-08-13 11:41     ` 程洋
2022-08-15  5:16     ` ZheNing Hu
2022-08-15 13:15       ` 程洋
2022-08-12 12:21 ` Derrick Stolee [this message]
2022-08-14  6:48 ` Jeff King
2022-08-15 13:18   ` Derrick Stolee
2022-08-15 14:50     ` [External Mail]Re: " 程洋
2022-08-17 10:22     ` 程洋
2022-08-17 13:41       ` Derrick Stolee
2022-08-18  5:49         ` Jeff King
2022-09-01  6:53   ` 程洋
2022-09-01 16:19     ` Jeff King
2022-09-05 11:17       ` 程洋
2022-09-06 18:38         ` Jeff King
2022-09-06 22:58           ` [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone Jeff King
2022-09-06 23:01             ` [PATCH 1/3] parse_object(): allow skipping hash check Jeff King
2022-09-07 14:15               ` Derrick Stolee
2022-09-07 20:44                 ` Jeff King
2022-09-06 23:05             ` [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects Jeff King
2022-09-07 14:36               ` Derrick Stolee
2022-09-07 14:45                 ` Derrick Stolee
2022-09-07 20:50                   ` Jeff King
2022-09-07 19:26               ` Junio C Hamano
2022-09-07 20:36                 ` Jeff King
2022-09-07 20:48                   ` [BUG] t1800: Fails for error text comparison rsbecker
2022-09-07 21:55                     ` Junio C Hamano
2022-09-07 22:23                       ` rsbecker
2022-09-07 21:02                   ` [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects Jeff King
2022-09-07 22:07                     ` Junio C Hamano
2022-09-08  5:04                       ` Jeff King
2022-09-08 16:41                         ` Junio C Hamano
2022-09-06 23:06             ` [PATCH 3/3] parse_object(): check commit-graph when skip_hash set Jeff King
2022-09-07 14:46               ` Derrick Stolee
2022-09-07 19:31               ` Junio C Hamano
2022-09-08 10:39                 ` [External Mail]Re: " 程洋
2022-09-08 18:42                   ` Jeff King
2022-09-07 14:48             ` [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=16633d89-6ccd-859d-8533-9861ad831c45@github.com \
    --to=derrickstolee@github.com \
    --cc=chengyang@xiaomi.com \
    --cc=fanjunhui@xiaomi.com \
    --cc=git@vger.kernel.org \
    --cc=hehao@xiaomi.com \
    --cc=maxin7@xiaomi.com \
    --cc=shifengbing@xiaomi.com \
    --cc=wanghanji@xiaomi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).