Partial-clone cause big performance impact on server

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Partial-clone cause big performance impact on server
@ 2022-08-11  8:09 程洋
  2022-08-11 17:22 ` Jonathan Tan
                   ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: 程洋 @ 2022-08-11  8:09 UTC (permalink / raw)
  To: git@vger.kernel.org
  Cc: 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

Hi.
     We observed big disk space save by partial-clone and require all of our users (2000+) to clone repository with partial-clone (filter=blob:none)
     However at busy time, we found it's extremely slow for user to fetch. Here is what we did.

    1. ask all users to fetch with filter=blob:none. And it's remarkable. Now our download size per user decrease from 460G to 180G.
    2. But at busy time, everyone's fetch become slow. (at idle hours, it takes us 5 minutes to clone a big repositories, but it takes more than 1 hour to clone the same repositories at busy hours)
    3. with GIT_TRACE_PACKET=1. We found on big repositories (200K+refs, 6m+ objects). Git will sends 40k want.
    4. And we then track our server(which is gerrit with jgit). We found the server is couting objects. Then we check those 40k objects, most of them are blobs rather than commit. (which means they're not in bitmap)
    5. We believe that's the root cause of our problem. Git sends too many "want SHA1" which are not in bitmap, cause the server to count objects  frequently, which then slow down the server.

What we want is, download the things we need to checkout to specific commit. But if one commit contain so many objects (like us , 40k+). It takes more time to counting than downloading.
Is it possible to let git only send "commit want" rather than all the objects SHA1 one by one?
#/******本邮件及其附件含有小米公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！ This e-mail and its attachments contain confidential information from XIAOMI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!******/#

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Partial-clone cause big performance impact on server
  2022-08-11  8:09 Partial-clone cause big performance impact on server 程洋
@ 2022-08-11 17:22 ` Jonathan Tan
  2022-08-13  7:55   ` 回复: [External Mail]Re: " 程洋
  2022-08-12 12:21 ` Derrick Stolee
  2022-08-14  6:48 ` Jeff King
  2 siblings, 1 reply; 40+ messages in thread
From: Jonathan Tan @ 2022-08-11 17:22 UTC (permalink / raw)
  To: 程洋
  Cc: Jonathan Tan, git@vger.kernel.org, 何浩,
	Xin7 Ma 马鑫, 石奉兵,
	凡军辉, 王汉基

程洋 <chengyang@xiaomi.com> writes:
>     3. with GIT_TRACE_PACKET=1. We found on big repositories (200K+refs, 6m+ objects). Git will sends 40k want.
>     4. And we then track our server(which is gerrit with jgit). We found the server is couting objects. Then we check those 40k objects, most of them are blobs rather than commit. (which means they're not in bitmap)
>     5. We believe that's the root cause of our problem. Git sends too many "want SHA1" which are not in bitmap, cause the server to count objects  frequently, which then slow down the server.
> 
> What we want is, download the things we need to checkout to specific commit. But if one commit contain so many objects (like us , 40k+). It takes more time to counting than downloading.
> Is it possible to let git only send "commit want" rather than all the objects SHA1 one by one?

On a technical level, it may be possible - at the point in the Git code
where the batch prefetch occurs, I'm not sure if we have the commit, but
we could plumb the commit information there. (We have the tree, but this
doesn't help us here because as far as I know, the tree won't be in the
bitmap so the server would need to count objects anyway, resulting in
the same problem.)

However, sending only commits as wants would mean that we would be
fetching more blobs than needed. For example, if we were to clone (with
checkout) and then checkout HEAD^, sending a "commit want" for the
latter checkout would result in all blobs referenced by the commit's
tree being fetched and not only the blobs that are different.

One idea that we (at $DAYJOB) had is to supply a commit hint so that the
server can first use bitmaps to narrow down the objects that need to be
checked. I had a preliminary patch for that [1] but as of now, no one
has continued pursuing that idea.

[1] https://lore.kernel.org/git/20201215200207.1083655-1-jonathantanmy@google.com/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Partial-clone cause big performance impact on server
  2022-08-11  8:09 Partial-clone cause big performance impact on server 程洋
  2022-08-11 17:22 ` Jonathan Tan
@ 2022-08-12 12:21 ` Derrick Stolee
  2022-08-14  6:48 ` Jeff King
  2 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee @ 2022-08-12 12:21 UTC (permalink / raw)
  To: 程洋, git@vger.kernel.org
  Cc: 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

On 8/11/22 4:09 AM, 程洋 wrote:> Hi.
>      We observed big disk space save by partial-clone and require all of our users (2000+) to clone repository with partial-clone (filter=blob:none)
>      However at busy time, we found it's extremely slow for user to fetch. Here is what we did.
> 
>     1. ask all users to fetch with filter=blob:none. And it's remarkable. Now our download size per user decrease from 460G to 180G.

I hope this includes the blob download during the initial checkout,
because otherwise you have a very strange shape to make your commits and
trees take up 180 GB.

>     2. But at busy time, everyone's fetch become slow. (at idle hours, it takes us 5 minutes to clone a big repositories, but it takes more than 1 hour to clone the same repositories at busy hours)
>     3. with GIT_TRACE_PACKET=1. We found on big repositories (200K+refs, 6m+ objects). Git will sends 40k want.

You only have six million objects in the repo and yet have that size? It
must be some very large blobs.

>     4. And we then track our server(which is gerrit with jgit). We found the server is couting objects. Then we check those 40k objects, most of them are blobs rather than commit. (which means they're not in bitmap)

Are you seeing any commits in these requests? If the Git client is asking
for blobs, then they should not be mixed with commit wants. What kind of
operation are you doing to see these mixed wants?

If the request was only blobs, then the server should not need a "Counting
objects" phase. It should jump immediately to preparing the objects (which
will likely require parsing deltas, and that can be expensive). I don't
know if JGit is doing something different, though.

>     5. We believe that's the root cause of our problem. Git sends too many "want SHA1" which are not in bitmap, cause the server to count objects  frequently, which then slow down the server.
> 
> What we want is, download the things we need to checkout to specific commit. But if one commit contain so many objects (like us , 40k+). It takes more time to counting than downloading.

One thing that the microsoft/git fork uses in its "git-gvfs-helper" tool
(which speaks the GVFS Protocol as a replacement for partial clone when
using Azure Repos as a server) is a batched download of missing objects [1].
The initial limit is 4000 objects at a time, but that helps keep each
request small enough that it is less likely to fail for scale reasons alone.

[1] https://github.com/microsoft/git/blob/vfs-2.37.1/gvfs-helper.c#L3510-L3520

It might be interesting to create such batch-downloads for these partial
clone blob-fetches.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 40+ messages in thread

* 回复: [External Mail]Re: Partial-clone cause big performance impact on server
  2022-08-11 17:22 ` Jonathan Tan
@ 2022-08-13  7:55   ` 程洋
  2022-08-13 11:41     ` 程洋
  2022-08-15  5:16     ` ZheNing Hu
  0 siblings, 2 replies; 40+ messages in thread
From: 程洋 @ 2022-08-13  7:55 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

> >     3. with GIT_TRACE_PACKET=1. We found on big repositories (200K+refs, 6m+ objects). Git will sends 40k want.
> >     4. And we then track our server(which is gerrit with jgit). We found the server is couting objects. Then we check those 40k objects, most of them are blobs rather than commit. (which means they're not in bitmap)
> >     5. We believe that's the root cause of our problem. Git sends too many "want SHA1" which are not in bitmap, cause the server to count objects frequently, which then slow down the server.
> >
> > What we want is, download the things we need to checkout to specific commit. But if one commit contain so many objects (like us , 40k+). It takes more time to counting than downloading.
> > Is it possible to let git only send "commit want" rather than all the objects SHA1 one by one?
>
> On a technical level, it may be possible - at the point in the Git code where the
> batch prefetch occurs, I'm not sure if we have the commit, but we could
> plumb the commit information there. (We have the tree, but this doesn't help
> us here because as far as I know, the tree won't be in the bitmap so the server
> would need to count objects anyway, resulting in the same problem.)
>
> However, sending only commits as wants would mean that we would be
> fetching more blobs than needed. For example, if we were to clone (with
> checkout) and then checkout HEAD^, sending a "commit want" for the latter
> checkout would result in all blobs referenced by the commit's tree being
> fetched and not only the blobs that are different.

It seems your solution require changes from both server side and client side
Why not we just add another filter, allow partial-clone always sends commit level want?
If we checkout HEAD~1, then client can send "want HEAD~1 HEAD~2".

> One idea that we (at $DAYJOB) had is to supply a commit hint so that the
> server can first use bitmaps to narrow down the objects that need to be
> checked. I had a preliminary patch for that [1] but as of now, no one has
> continued pursuing that idea.
>
> [1] https://lore.kernel.org/git/20201215200207.1083655-1-jonathantanmy@google.com/
#/******本邮件及其附件含有小米公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！ This e-mail and its attachments contain confidential information from XIAOMI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!******/#

^ permalink raw reply	[flat|nested] 40+ messages in thread

* 回复: [External Mail]Re: Partial-clone cause big performance impact on server
  2022-08-13  7:55   ` 回复: [External Mail]Re: " 程洋
@ 2022-08-13 11:41     ` 程洋
  2022-08-15  5:16     ` ZheNing Hu
  1 sibling, 0 replies; 40+ messages in thread
From: 程洋 @ 2022-08-13 11:41 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

Another thing I need to point out is, current partial-clone also have big performance impact on disks. Especially on SMR disks.
Some of SMR disk has really bad random write performance (like 100kb/s).
We found typically it takes us 50 minutes to download whole Android project by partial-clone (2 hours and a half without partial-clone)
However it takes us 5 hours to download with partial-clone on those SMR disks.

That's also because the big number wants make the download process to be made of random writes rather than sequence writes.

> > >     3. with GIT_TRACE_PACKET=1. We found on big repositories (200K+refs, 6m+ objects). Git will sends 40k want.
> > >     4. And we then track our server(which is gerrit with jgit). We found the
> server is couting objects. Then we check those 40k objects, most of them are
> blobs rather than commit. (which means they're not in bitmap)
> > >     5. We believe that's the root cause of our problem. Git sends too many "want SHA1" which are not in bitmap, cause the server to count objects frequently, which then slow down the server.
> > >
> > > What we want is, download the things we need to checkout to specific commit. But if one commit contain so many objects (like us , 40k+). It takes more time to counting than downloading.
> > > Is it possible to let git only send "commit want" rather than all the objects SHA1 one by one?
> >
> > On a technical level, it may be possible - at the point in the Git
> > code where the batch prefetch occurs, I'm not sure if we have the
> > commit, but we could plumb the commit information there. (We have the
> > tree, but this doesn't help us here because as far as I know, the tree
> > won't be in the bitmap so the server would need to count objects
> > anyway, resulting in the same problem.)
> >
> > However, sending only commits as wants would mean that we would be
> > fetching more blobs than needed. For example, if we were to clone
> > (with
> > checkout) and then checkout HEAD^, sending a "commit want" for the
> > latter checkout would result in all blobs referenced by the commit's
> > tree being fetched and not only the blobs that are different.
>
> It seems your solution require changes from both server side and client side
> Why not we just add another filter, allow partial-clone always sends commit
> level want?
> If we checkout HEAD~1, then client can send "want HEAD~1 HEAD~2".
>
> > One idea that we (at $DAYJOB) had is to supply a commit hint so that
> > the server can first use bitmaps to narrow down the objects that need
> > to be checked. I had a preliminary patch for that [1] but as of now,
> > no one has continued pursuing that idea.
> >
> > [1]
> > https://lore.kernel.org/git/20201215200207.1083655-1-
> jonathantanmy@goo
> > gle.com/
#/******本邮件及其附件含有小米公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！ This e-mail and its attachments contain confidential information from XIAOMI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!******/#

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Partial-clone cause big performance impact on server
  2022-08-11  8:09 Partial-clone cause big performance impact on server 程洋
  2022-08-11 17:22 ` Jonathan Tan
  2022-08-12 12:21 ` Derrick Stolee
@ 2022-08-14  6:48 ` Jeff King
  2022-08-15 13:18   ` Derrick Stolee
  2022-09-01  6:53   ` 程洋
  2 siblings, 2 replies; 40+ messages in thread
From: Jeff King @ 2022-08-14  6:48 UTC (permalink / raw)
  To: 程洋
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

On Thu, Aug 11, 2022 at 08:09:56AM +0000, 程洋 wrote:

>     4. And we then track our server(which is gerrit with jgit). We
>        found the server is couting objects. Then we check those 40k
>        objects, most of them are blobs rather than commit. (which
>        means they're not in bitmap)
>     5. We believe that's the root cause of our problem. Git sends too
>        many "want SHA1" which are not in bitmap, cause the server to
>        count objects  frequently, which then slow down the server.

I'd be surprised if bitmaps make a big difference either way here, since
blobs are very quick in the "counting" phase of pack-objects. They can't
link to anything else, so we should not be opening the object contents
at all! We just need to find them on disk, and then in many cases we can
send them over the wire without even decompressing (the exception is if
they are stored as deltas against an object the client doesn't have).

I didn't generate a test case, but I'm pretty sure that is how git.git's
pack-objects should behave. But you mentioned that the server is jgit;
it's possible that it isn't as optimized in that area.

-Peff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [External Mail]Re: Partial-clone cause big performance impact on server
  2022-08-13  7:55   ` 回复: [External Mail]Re: " 程洋
  2022-08-13 11:41     ` 程洋
@ 2022-08-15  5:16     ` ZheNing Hu
  2022-08-15 13:15       ` 程洋
  1 sibling, 1 reply; 40+ messages in thread
From: ZheNing Hu @ 2022-08-15  5:16 UTC (permalink / raw)
  To: 程洋
  Cc: Jonathan Tan, git@vger.kernel.org, 何浩,
	Xin7 Ma 马鑫, 石奉兵,
	凡军辉, 王汉基

程洋 <chengyang@xiaomi.com> 于2022年8月13日周六 16:00写道：
>
> > >     3. with GIT_TRACE_PACKET=1. We found on big repositories (200K+refs, 6m+ objects). Git will sends 40k want.
> > >     4. And we then track our server(which is gerrit with jgit). We found the server is couting objects. Then we check those 40k objects, most of them are blobs rather than commit. (which means they're not in bitmap)
> > >     5. We believe that's the root cause of our problem. Git sends too many "want SHA1" which are not in bitmap, cause the server to count objects frequently, which then slow down the server.
> > >
> > > What we want is, download the things we need to checkout to specific commit. But if one commit contain so many objects (like us , 40k+). It takes more time to counting than downloading.
> > > Is it possible to let git only send "commit want" rather than all the objects SHA1 one by one?
> >
> > On a technical level, it may be possible - at the point in the Git code where the
> > batch prefetch occurs, I'm not sure if we have the commit, but we could
> > plumb the commit information there. (We have the tree, but this doesn't help
> > us here because as far as I know, the tree won't be in the bitmap so the server
> > would need to count objects anyway, resulting in the same problem.)
> >
> > However, sending only commits as wants would mean that we would be
> > fetching more blobs than needed. For example, if we were to clone (with
> > checkout) and then checkout HEAD^, sending a "commit want" for the latter
> > checkout would result in all blobs referenced by the commit's tree being
> > fetched and not only the blobs that are different.
>
> It seems your solution require changes from both server side and client side
> Why not we just add another filter, allow partial-clone always sends commit level want?
> If we checkout HEAD~1, then client can send "want HEAD~1 HEAD~2".
>

I am interesting about this question too, maybe I can try if we can do
this.. ;-)

ZheNing Hu

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [External Mail]Re: Partial-clone cause big performance impact on server
  2022-08-15  5:16     ` ZheNing Hu
@ 2022-08-15 13:15       ` 程洋
  0 siblings, 0 replies; 40+ messages in thread
From: 程洋 @ 2022-08-15 13:15 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Jonathan Tan, git@vger.kernel.org, 何浩,
	Xin7 Ma 马鑫, 石奉兵,
	凡军辉, 王汉基

There is a really easy way to reproduce it

git clone --filter=blob:none -b master "https://android.googlesource.com/platform/prebuilts/gradle-plugin"

Even Google AOSP Gerrit will have this problem. You will find it hang for minutes on checkout


> -----Original Message-----
> From: ZheNing Hu <adlternative@gmail.com>
> Sent: Monday, August 15, 2022 1:16 PM
> To: 程洋 <chengyang@xiaomi.com>
> Cc: Jonathan Tan <jonathantanmy@google.com>; git@vger.kernel.org; 何浩
> <hehao@xiaomi.com>; Xin7 Ma 马鑫 <maxin7@xiaomi.com>; 石奉兵
> <shifengbing@xiaomi.com>; 凡军辉 <fanjunhui@xiaomi.com>; 王汉基
> <wanghanji@xiaomi.com>
> Subject: Re: [External Mail]Re: Partial-clone cause big performance impact on
> server
>
> [外部邮件] 此邮件来源于小米公司外部，请谨慎处理。
>
> 程洋 <chengyang@xiaomi.com> 于2022年8月13日周六 16:00写道：
> >
> > > >     3. with GIT_TRACE_PACKET=1. We found on big repositories
> (200K+refs, 6m+ objects). Git will sends 40k want.
> > > >     4. And we then track our server(which is gerrit with jgit). We found
> the server is couting objects. Then we check those 40k objects, most of them
> are blobs rather than commit. (which means they're not in bitmap)
> > > >     5. We believe that's the root cause of our problem. Git sends too
> many "want SHA1" which are not in bitmap, cause the server to count
> objects frequently, which then slow down the server.
> > > >
> > > > What we want is, download the things we need to checkout to specific
> commit. But if one commit contain so many objects (like us , 40k+). It takes
> more time to counting than downloading.
> > > > Is it possible to let git only send "commit want" rather than all the
> objects SHA1 one by one?
> > >
> > > On a technical level, it may be possible - at the point in the Git
> > > code where the batch prefetch occurs, I'm not sure if we have the
> > > commit, but we could plumb the commit information there. (We have
> > > the tree, but this doesn't help us here because as far as I know,
> > > the tree won't be in the bitmap so the server would need to count
> > > objects anyway, resulting in the same problem.)
> > >
> > > However, sending only commits as wants would mean that we would be
> > > fetching more blobs than needed. For example, if we were to clone
> > > (with
> > > checkout) and then checkout HEAD^, sending a "commit want" for the
> > > latter checkout would result in all blobs referenced by the commit's
> > > tree being fetched and not only the blobs that are different.
> >
> > It seems your solution require changes from both server side and
> > client side Why not we just add another filter, allow partial-clone always
> sends commit level want?
> > If we checkout HEAD~1, then client can send "want HEAD~1 HEAD~2".
> >
>
> I am interesting about this question too, maybe I can try if we can do this.. ;-)
>
> ZheNing Hu
#/******本邮件及其附件含有小米公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！ This e-mail and its attachments contain confidential information from XIAOMI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!******/#

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Partial-clone cause big performance impact on server
  2022-08-14  6:48 ` Jeff King
@ 2022-08-15 13:18   ` Derrick Stolee
  2022-08-15 14:50     ` [External Mail]Re: " 程洋
  2022-08-17 10:22     ` 程洋
  2022-09-01  6:53   ` 程洋
  1 sibling, 2 replies; 40+ messages in thread
From: Derrick Stolee @ 2022-08-15 13:18 UTC (permalink / raw)
  To: Jeff King, 程洋
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

On 8/14/2022 2:48 AM, Jeff King wrote:
> On Thu, Aug 11, 2022 at 08:09:56AM +0000, 程洋 wrote:
> 
>>     4. And we then track our server(which is gerrit with jgit). We
>>        found the server is couting objects. Then we check those 40k
>>        objects, most of them are blobs rather than commit. (which
>>        means they're not in bitmap)
>>     5. We believe that's the root cause of our problem. Git sends too
>>        many "want SHA1" which are not in bitmap, cause the server to
>>        count objects  frequently, which then slow down the server.
> 
> I'd be surprised if bitmaps make a big difference either way here, since
> blobs are very quick in the "counting" phase of pack-objects. They can't
> link to anything else, so we should not be opening the object contents
> at all! We just need to find them on disk, and then in many cases we can
> send them over the wire without even decompressing (the exception is if
> they are stored as deltas against an object the client doesn't have).
> 
> I didn't generate a test case, but I'm pretty sure that is how git.git's
> pack-objects should behave. But you mentioned that the server is jgit;
> it's possible that it isn't as optimized in that area.

I just remembered that Gerrit specifically has branch-level security,
where some branches are not visible to all users. For that reason, blobs
cannot be served without first determining if they are reachable from a
branch visible to the current user.

I'm not sure if that's the problem in this particular case, but it could
be.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [External Mail]Re: Partial-clone cause big performance impact on server
  2022-08-15 13:18   ` Derrick Stolee
@ 2022-08-15 14:50     ` 程洋
  2022-08-17 10:22     ` 程洋
  1 sibling, 0 replies; 40+ messages in thread
From: 程洋 @ 2022-08-15 14:50 UTC (permalink / raw)
  To: Derrick Stolee, Jeff King
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

https://gerrit.googlesource.com/jgit/+/refs/heads/master/org.eclipse.jgit/src/org/eclipse/jgit/transport/UploadPack.java#2037
It seems you're right.
According to commnet I found in jgit source code. It seems it is indeed doing what you said!

> On 8/14/2022 2:48 AM, Jeff King wrote:
> > On Thu, Aug 11, 2022 at 08:09:56AM +0000, 程洋 wrote:
> >
> >>     4. And we then track our server(which is gerrit with jgit). We
> >>        found the server is couting objects. Then we check those 40k
> >>        objects, most of them are blobs rather than commit. (which
> >>        means they're not in bitmap)
> >>     5. We believe that's the root cause of our problem. Git sends too
> >>        many "want SHA1" which are not in bitmap, cause the server to
> >>        count objects  frequently, which then slow down the server.
> >
> > I'd be surprised if bitmaps make a big difference either way here,
> > since blobs are very quick in the "counting" phase of pack-objects.
> > They can't link to anything else, so we should not be opening the
> > object contents at all! We just need to find them on disk, and then in
> > many cases we can send them over the wire without even decompressing
> > (the exception is if they are stored as deltas against an object the client
> doesn't have).
> >
> > I didn't generate a test case, but I'm pretty sure that is how
> > git.git's pack-objects should behave. But you mentioned that the
> > server is jgit; it's possible that it isn't as optimized in that area.
>
> I just remembered that Gerrit specifically has branch-level security, where
> some branches are not visible to all users. For that reason, blobs cannot be
> served without first determining if they are reachable from a branch visible
> to the current user.
>
> I'm not sure if that's the problem in this particular case, but it could be.
>
> Thanks,
> -Stolee

#/******本邮件及其附件含有小米公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！ This e-mail and its attachments contain confidential information from XIAOMI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!******/#

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [External Mail]Re: Partial-clone cause big performance impact on server
  2022-08-15 13:18   ` Derrick Stolee
  2022-08-15 14:50     ` [External Mail]Re: " 程洋
@ 2022-08-17 10:22     ` 程洋
  2022-08-17 13:41       ` Derrick Stolee
  1 sibling, 1 reply; 40+ messages in thread
From: 程洋 @ 2022-08-17 10:22 UTC (permalink / raw)
  To: Derrick Stolee, Jeff King
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

But I still think the protocol still should tell the server which ref the blob is reachable.
Because it would be really hard to implement any kind of ACL
But git is surely designed for open sources community. It makes senses this request will be rejected.

> -----Original Message-----
> From: Derrick Stolee <derrickstolee@github.com>
> Sent: Monday, August 15, 2022 9:19 PM
> To: Jeff King <peff@peff.net>; 程洋 <chengyang@xiaomi.com>
> Cc: git@vger.kernel.org; 何浩 <hehao@xiaomi.com>; Xin7 Ma 马鑫
> <maxin7@xiaomi.com>; 石奉兵 <shifengbing@xiaomi.com>; 凡军辉
> <fanjunhui@xiaomi.com>; 王汉基 <wanghanji@xiaomi.com>
> Subject: [External Mail]Re: Partial-clone cause big performance impact on
> server
>
> [外部邮件] 此邮件来源于小米公司外部，请谨慎处理。
>
> On 8/14/2022 2:48 AM, Jeff King wrote:
> > On Thu, Aug 11, 2022 at 08:09:56AM +0000, 程洋 wrote:
> >
> >>     4. And we then track our server(which is gerrit with jgit). We
> >>        found the server is couting objects. Then we check those 40k
> >>        objects, most of them are blobs rather than commit. (which
> >>        means they're not in bitmap)
> >>     5. We believe that's the root cause of our problem. Git sends too
> >>        many "want SHA1" which are not in bitmap, cause the server to
> >>        count objects  frequently, which then slow down the server.
> >
> > I'd be surprised if bitmaps make a big difference either way here,
> > since blobs are very quick in the "counting" phase of pack-objects.
> > They can't link to anything else, so we should not be opening the
> > object contents at all! We just need to find them on disk, and then in
> > many cases we can send them over the wire without even decompressing
> > (the exception is if they are stored as deltas against an object the client
> doesn't have).
> >
> > I didn't generate a test case, but I'm pretty sure that is how
> > git.git's pack-objects should behave. But you mentioned that the
> > server is jgit; it's possible that it isn't as optimized in that area.
>
> I just remembered that Gerrit specifically has branch-level security, where
> some branches are not visible to all users. For that reason, blobs cannot be
> served without first determining if they are reachable from a branch visible
> to the current user.
>
> I'm not sure if that's the problem in this particular case, but it could be.
>
> Thanks,
> -Stolee

#/******本邮件及其附件含有小米公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！ This e-mail and its attachments contain confidential information from XIAOMI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!******/#

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [External Mail]Re: Partial-clone cause big performance impact on server
  2022-08-17 10:22     ` 程洋
@ 2022-08-17 13:41       ` Derrick Stolee
  2022-08-18  5:49         ` Jeff King
  0 siblings, 1 reply; 40+ messages in thread
From: Derrick Stolee @ 2022-08-17 13:41 UTC (permalink / raw)
  To: 程洋, Jeff King
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

On 8/17/2022 6:22 AM, 程洋 wrote:
> But I still think the protocol still should tell the server which ref
> the blob is reachable.
> Because it would be really hard to implement any kind of ACL

I think this idea has merit on its face, but it wouldn't really solve the
problem since the reachability query would still need to be done, just
from a smaller set of references at first. If we were able to say "this
blob can be found at path X at commit Y" then the server could do a
commit-reachability query and a path traversal, which should be a lot
faster.

However, it would be extremely difficult to plumb into the partial clone
machinery. At the point where Git realizes it is missing a promisor
object, that code is very generic and removed from any kind of walk from a
reference. That is further complicated by the fact that the walk is
probably from a local reference, which can be entirely different from the
remote reference.

> But git is surely designed for open sources community. It makes senses
> this request will be rejected.

We try to keep all kinds of users in mind, so the fact that this applies
to not-completely-open repositories is not a blocker.

One possible hurdle is the fact that this branch-level security is a
feature of Gerrit, not a feature of Git itself. Optimizing Git to that
special case that Git does not itself support is less valuable to the Git
project itself.

My personal take is that the technical complexity required to make this
faster paired with the limited scope means that this feature would have a
difficult time getting accepted into the Git project. Perhaps a motivated
contributor will find ways to overcome these obstacles and find other
interesting applications that benefit a larger portion of Git users.

That's just my expectation. I'd be happy to read any patches that try to
solve this problem.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [External Mail]Re: Partial-clone cause big performance impact on server
  2022-08-17 13:41       ` Derrick Stolee
@ 2022-08-18  5:49         ` Jeff King
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff King @ 2022-08-18  5:49 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: 程洋, git@vger.kernel.org, 何浩,
	Xin7 Ma 马鑫, 石奉兵,
	凡军辉, 王汉基

On Wed, Aug 17, 2022 at 09:41:10AM -0400, Derrick Stolee wrote:

> On 8/17/2022 6:22 AM, 程洋 wrote:
> > But I still think the protocol still should tell the server which ref
> > the blob is reachable.
> > Because it would be really hard to implement any kind of ACL
> 
> I think this idea has merit on its face, but it wouldn't really solve the
> problem since the reachability query would still need to be done, just
> from a smaller set of references at first. If we were able to say "this
> blob can be found at path X at commit Y" then the server could do a
> commit-reachability query and a path traversal, which should be a lot
> faster.
> 
> However, it would be extremely difficult to plumb into the partial clone
> machinery. At the point where Git realizes it is missing a promisor
> object, that code is very generic and removed from any kind of walk from a
> reference. That is further complicated by the fact that the walk is
> probably from a local reference, which can be entirely different from the
> remote reference.

Agreed. The client often doesn't know the context of what it's asking
for in the first place. Sometimes it's not carried through the code, but
we also have commands that might not be invoked with a commit in the
first place! It's valid to run "git read-tree <tree>", and we should be
able to fault in blobs from that tree as needed.

I also think that this kind of "is the blob reachable" query is
mostly expensive if you don't have reachability bitmaps at all. If you
do, then the cost to ask "is this object reachable" is the same for a
commit or a blob. If the server has a bitmap of all objects reachable
for each branch ACL (even if it has to do some small bit of fill-in
walking to bring it up to date), then querying for any object type the
client asks for is still just a bit lookup.

Not knowing a lot about gerrit or jgit, it's not clear to me if there
are configuration knobs that could be tweaked on the server side to make
these requests more efficient.

> One possible hurdle is the fact that this branch-level security is a
> feature of Gerrit, not a feature of Git itself. Optimizing Git to that
> special case that Git does not itself support is less valuable to the Git
> project itself.

We don't have branch-level security per se, but I do think that
everything is there in Git to do fast "is this object reachable from
these branches" queries. If you're making a lot of those queries it
might influence your decision of which bitmaps to generate, but the
bitmap concept itself should be sufficient.

-Peff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [External Mail]Re: Partial-clone cause big performance impact on server
  2022-08-14  6:48 ` Jeff King
  2022-08-15 13:18   ` Derrick Stolee
@ 2022-09-01  6:53   ` 程洋
  2022-09-01 16:19     ` Jeff King
  1 sibling, 1 reply; 40+ messages in thread
From: 程洋 @ 2022-09-01  6:53 UTC (permalink / raw)
  To: Jeff King
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

>
> >     4. And we then track our server(which is gerrit with jgit). We
> >        found the server is couting objects. Then we check those 40k
> >        objects, most of them are blobs rather than commit. (which
> >        means they're not in bitmap)
> >     5. We believe that's the root cause of our problem. Git sends too
> >        many "want SHA1" which are not in bitmap, cause the server to
> >        count objects  frequently, which then slow down the server.
>
> I'd be surprised if bitmaps make a big difference either way here, since blobs
> are very quick in the "counting" phase of pack-objects. They can't link to
> anything else, so we should not be opening the object contents at all! We
> just need to find them on disk, and then in many cases we can send them
> over the wire without even decompressing (the exception is if they are
> stored as deltas against an object the client doesn't have).
>
> I didn't generate a test case, but I'm pretty sure that is how git.git's pack-
> objects should behave. But you mentioned that the server is jgit; it's possible
> that it isn't as optimized in that area.

At first I also think it's some implementation bugs by jgit. However I can also reproduce it on cgit. Here is the steps, I'm not sure if you can reproduce too.

1. Clone a repository from AOSP to local machine:  `git clone "https://android.googlesource.com/platform/prebuilts/gradle-plugin"`
2. try to clone from localhost using cgit server.   `GIT_TRACE_PACKET=1 git clone --filter=blob:none -b master user@localhost:/home/user/repositories/gradle-plugin `
3. During checkout phase, it also takes 15 seconds before actual downloading.

It's really easy to reproduce, as long as the repository has so many blobs to checkout

>
> -Peff
#/******本邮件及其附件含有小米公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！ This e-mail and its attachments contain confidential information from XIAOMI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!******/#

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [External Mail]Re: Partial-clone cause big performance impact on server
  2022-09-01  6:53   ` 程洋
@ 2022-09-01 16:19     ` Jeff King
  2022-09-05 11:17       ` 程洋
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff King @ 2022-09-01 16:19 UTC (permalink / raw)
  To: 程洋
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

On Thu, Sep 01, 2022 at 06:53:01AM +0000, 程洋 wrote:

> At first I also think it's some implementation bugs by jgit. However I
> can also reproduce it on cgit. Here is the steps, I'm not sure if you
> can reproduce too.
> 
> 1. Clone a repository from AOSP to local machine:  `git clone
>    "https://android.googlesource.com/platform/prebuilts/gradle-plugin"`
> 2. try to clone from localhost using cgit server.
>    `GIT_TRACE_PACKET=1 git clone --filter=blob:none -b master
>    user@localhost:/home/user/repositories/gradle-plugin `
> 3. During checkout phase, it also takes 15 seconds before actual downloading.

I don't see that at all. A few things on your reproduction:

  - you have to tell the local server repo that filters are OK:

      git -C gradle-plugin config uploadpack.allowfilter true

  - your example goes over localhost ssh. Is your server configured to
    allow passing the GIT_PROTOCOL environment variable? If not, you're
    using the v0 protocol. In which case you'll have to set a config
    option to allow clients to fetch objects that the server didn't
    advertise.

    If you do it with allowReachableSHA1InWant, like this:

      git -C gradle-plugin config uploadpack.allowReachableSHA1InWant true

    then there will be a short delay while it checks their
    reachability. That check happens via an external rev-list. I think
    it's not clever enough to use bitmaps, though it could. However, in
    this example, the delay only seems to be around 800ms for me (and of
    course we didn't generate bitmaps anyway, so that wouldn't matter).

    If you instead do:

      git -C gradle-plugin config uploadpack.allowAnySHA1InWant true

    then that reachability check goes away.

    But on modern servers, most of this should be moot anyway. A
    well-configured server should support protocol v2, which defaults to
    allowAnySHA1InWant.

    If you use --no-local to disable local-clone optimizations, then you
    can use --filter without having to go through ssh. That should use
    protocol version 2, as a "real" server would.

So all told, I think a more realistic reproduction is:

  $ git clone https://android.googlesource.com/platform/prebuilts/gradle-plugin
  $ git -C gradle-plugin config uploadpack.allowfilter true
  $ git clone --filter=blob:none --no-local -b master grade-plugin foo

which takes ~3s for me.

I do think upload-pack spends more time than it needs in this case, as
it's keen to call parse_object() on oids that the client asks for. Which
means we'll open up those blobs and check their sha1s before sending
them out, which isn't strictly necessary.

All of this seems orthogonal to the original claim that "Counting
objects" was taking a while, though. The delays here are all inside
upload-pack, before it spawns pack-objects. And it's pack-objects that
says "Counting objects".

-Peff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [External Mail]Re: Partial-clone cause big performance impact on server
  2022-09-01 16:19     ` Jeff King
@ 2022-09-05 11:17       ` 程洋
  2022-09-06 18:38         ` Jeff King
  0 siblings, 1 reply; 40+ messages in thread
From: 程洋 @ 2022-09-05 11:17 UTC (permalink / raw)
  To: Jeff King
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

Sorry, I told you the wrong branch. It should be "android-t-preview-1"
git clone --filter=blob:none --no-local -b android-t-preview-1 grade-plugin

Can you try this one?

> > At first I also think it's some implementation bugs by jgit. However I
> > can also reproduce it on cgit. Here is the steps, I'm not sure if you
> > can reproduce too.
> >
> > 1. Clone a repository from AOSP to local machine:  `git clone
> >
> > "https://android.googlesource.com/platform/prebuilts/gradle-plugin"`
> > 2. try to clone from localhost using cgit server.
> >    `GIT_TRACE_PACKET=1 git clone --filter=blob:none -b master
> >    user@localhost:/home/user/repositories/gradle-plugin ` 3. During
> > checkout phase, it also takes 15 seconds before actual downloading.
>
> I don't see that at all. A few things on your reproduction:
>
>   - you have to tell the local server repo that filters are OK:
>
>       git -C gradle-plugin config uploadpack.allowfilter true
>
>   - your example goes over localhost ssh. Is your server configured to
>     allow passing the GIT_PROTOCOL environment variable? If not, you're
>     using the v0 protocol. In which case you'll have to set a config
>     option to allow clients to fetch objects that the server didn't
>     advertise.
>
>     If you do it with allowReachableSHA1InWant, like this:
>
>       git -C gradle-plugin config uploadpack.allowReachableSHA1InWant true
>
>     then there will be a short delay while it checks their
>     reachability. That check happens via an external rev-list. I think
>     it's not clever enough to use bitmaps, though it could. However, in
>     this example, the delay only seems to be around 800ms for me (and of
>     course we didn't generate bitmaps anyway, so that wouldn't matter).
>
>     If you instead do:
>
>       git -C gradle-plugin config uploadpack.allowAnySHA1InWant true
>
>     then that reachability check goes away.
>
>     But on modern servers, most of this should be moot anyway. A
>     well-configured server should support protocol v2, which defaults to
>     allowAnySHA1InWant.
>
>     If you use --no-local to disable local-clone optimizations, then you
>     can use --filter without having to go through ssh. That should use
>     protocol version 2, as a "real" server would.
>
> So all told, I think a more realistic reproduction is:
>
>   $ git clone https://android.googlesource.com/platform/prebuilts/gradle-
> plugin
>   $ git -C gradle-plugin config uploadpack.allowfilter true
>   $ git clone --filter=blob:none --no-local -b master grade-plugin foo
>
> which takes ~3s for me.
>
> I do think upload-pack spends more time than it needs in this case, as it's
> keen to call parse_object() on oids that the client asks for. Which means we'll
> open up those blobs and check their sha1s before sending them out, which
> isn't strictly necessary.
>
> All of this seems orthogonal to the original claim that "Counting objects" was
> taking a while, though. The delays here are all inside upload-pack, before it
> spawns pack-objects. And it's pack-objects that says "Counting objects".
>
> -Peff
#/******本邮件及其附件含有小米公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！ This e-mail and its attachments contain confidential information from XIAOMI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!******/#

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [External Mail]Re: Partial-clone cause big performance impact on server
  2022-09-05 11:17       ` 程洋
@ 2022-09-06 18:38         ` Jeff King
  2022-09-06 22:58           ` [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone Jeff King
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff King @ 2022-09-06 18:38 UTC (permalink / raw)
  To: 程洋
  Cc: Derrick Stolee, git@vger.kernel.org, 何浩,
	Xin7 Ma 马鑫, 石奉兵,
	凡军辉, 王汉基

On Mon, Sep 05, 2022 at 11:17:21AM +0000, 程洋 wrote:

> Sorry, I told you the wrong branch. It should be "android-t-preview-1"
> git clone --filter=blob:none --no-local -b android-t-preview-1 grade-plugin
> 
> Can you try this one?

Yes, I see more slow-down there. There are many more blobs there, but I
don't think it's really the number of them, but their sizes.

The problem is that both upload-pack and pack-objects are keen to call
parse_object() on their inputs. For commits, etc, that is usually
sensible; we have to parse the object to see what it points to. But for
blobs, the only thing we do is inflate a ton of bytes in order to check
the sha1. That's not really productive here; if there is a bit
corruption, the client will notice it on the receiving side.

So doing this:

diff --git a/object.c b/object.c
index 588b8156f1..6fbf6b2a7e 100644
--- a/object.c
+++ b/object.c
@@ -279,10 +279,6 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 	if ((obj && obj->type == OBJ_BLOB && repo_has_object_file(r, oid)) ||
 	    (!obj && repo_has_object_file(r, oid) &&
 	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
-		if (stream_object_signature(r, repl) < 0) {
-			error(_("hash mismatch %s"), oid_to_hex(oid));
-			return NULL;
-		}
 		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
 		return lookup_object(r, oid);
 	}

makes the follow-up fetch very fast. But there are some parts of the
code that rely on this parsing to do the on-disk checks. So I think we'd
probably need to take a flag, and plumb it through from the appropriate
spots.

A sample patch is below, which makes things fast for the example I
showed (it might not for yours, because if you're using the v0 protocol
over ssh, I suspect a different parse_object() call might need to be
tweaked). I think it should be fairly safe, because in both callsites
we're already using the commit-graph to avoid checking the sha1 for
commits (actually, I think this may be an existing subtle breakage in
"rev-list --verify-objects", but that should be fixed separately).

---
diff --git a/object.c b/object.c
index 588b8156f1..97324bc406 100644
--- a/object.c
+++ b/object.c
@@ -263,7 +263,9 @@ struct object *parse_object_or_die(const struct object_id *oid,
 	die(_("unable to parse object: %s"), name ? name : oid_to_hex(oid));
 }
 
-struct object *parse_object(struct repository *r, const struct object_id *oid)
+struct object *parse_object_with_flags(struct repository *r,
+				       const struct object_id *oid,
+				       enum parse_object_flags flags)
 {
 	unsigned long size;
 	enum object_type type;
@@ -279,7 +281,8 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 	if ((obj && obj->type == OBJ_BLOB && repo_has_object_file(r, oid)) ||
 	    (!obj && repo_has_object_file(r, oid) &&
 	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
-		if (stream_object_signature(r, repl) < 0) {
+		if (!(flags & PARSE_OBJECT_SKIP_HASH_CHECK) &&
+		    stream_object_signature(r, repl) < 0) {
 			error(_("hash mismatch %s"), oid_to_hex(oid));
 			return NULL;
 		}
@@ -289,7 +292,8 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 
 	buffer = repo_read_object_file(r, oid, &type, &size);
 	if (buffer) {
-		if (check_object_signature(r, repl, buffer, size, type) < 0) {
+		if (!(flags & PARSE_OBJECT_SKIP_HASH_CHECK) &&
+		    check_object_signature(r, repl, buffer, size, type) < 0) {
 			free(buffer);
 			error(_("hash mismatch %s"), oid_to_hex(repl));
 			return NULL;
@@ -304,6 +308,11 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 	return NULL;
 }
 
+struct object *parse_object(struct repository *r, const struct object_id *oid)
+{
+	return parse_object_with_flags(r, oid, 0);
+}
+
 struct object_list *object_list_insert(struct object *item,
 				       struct object_list **list_p)
 {
diff --git a/object.h b/object.h
index 9caef89f1f..cd5b7d6306 100644
--- a/object.h
+++ b/object.h
@@ -137,6 +137,13 @@ struct object *parse_object(struct repository *r, const struct object_id *oid);
  */
 struct object *parse_object_or_die(const struct object_id *oid, const char *name);
 
+enum parse_object_flags {
+	PARSE_OBJECT_SKIP_HASH_CHECK = 1 << 0,
+};
+struct object *parse_object_with_flags(struct repository *r,
+				       const struct object_id *oid,
+				       enum parse_object_flags flags);
+
 /* Given the result of read_sha1_file(), returns the object after
  * parsing it.  eaten_p indicates if the object has a borrowed copy
  * of buffer and the caller should not free() it.
diff --git a/revision.c b/revision.c
index ee702e498a..8cbffc0325 100644
--- a/revision.c
+++ b/revision.c
@@ -384,7 +384,8 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
 	if (commit)
 		object = &commit->object;
 	else
-		object = parse_object(revs->repo, oid);
+		object = parse_object_with_flags(revs->repo, oid,
+						 PARSE_OBJECT_SKIP_HASH_CHECK);
 
 	if (!object) {
 		if (revs->ignore_missing)
diff --git a/upload-pack.c b/upload-pack.c
index b217a1f469..4bacdf087c 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -1420,7 +1420,8 @@ static int parse_want(struct packet_writer *writer, const char *line,
 		if (commit)
 			o = &commit->object;
 		else
-			o = parse_object(the_repository, &oid);
+			o = parse_object_with_flags(the_repository, &oid,
+						    PARSE_OBJECT_SKIP_HASH_CHECK);
 
 		if (!o) {
 			packet_writer_error(writer,

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone
  2022-09-06 18:38         ` Jeff King
@ 2022-09-06 22:58           ` Jeff King
  2022-09-06 23:01             ` [PATCH 1/3] parse_object(): allow skipping hash check Jeff King
                               ` (3 more replies)
  0 siblings, 4 replies; 40+ messages in thread
From: Jeff King @ 2022-09-06 22:58 UTC (permalink / raw)
  To: 程洋
  Cc: Derrick Stolee, git@vger.kernel.org, 何浩,
	Xin7 Ma 马鑫, 石奉兵,
	凡军辉, 王汉基

On Tue, Sep 06, 2022 at 02:38:41PM -0400, Jeff King wrote:

> On Mon, Sep 05, 2022 at 11:17:21AM +0000, 程洋 wrote:
> 
> > Sorry, I told you the wrong branch. It should be "android-t-preview-1"
> > git clone --filter=blob:none --no-local -b android-t-preview-1 grade-plugin
> > 
> > Can you try this one?
> 
> Yes, I see more slow-down there. There are many more blobs there, but I
> don't think it's really the number of them, but their sizes.
> 
> The problem is that both upload-pack and pack-objects are keen to call
> parse_object() on their inputs. For commits, etc, that is usually
> sensible; we have to parse the object to see what it points to. But for
> blobs, the only thing we do is inflate a ton of bytes in order to check
> the sha1. That's not really productive here; if there is a bit
> corruption, the client will notice it on the receiving side.

So here's a cleaned-up series which makes this a lot faster.

The special sauce is in patch 2, along with timings. The first one is
just preparing, and the final one is a small cleanup it enables.

I based these directly on the current tip of master. There will be a
small conflict in the test script when merging with the "rev-list
--verify-objects" series I just sent in:

  https://lore.kernel.org/git/Yxe0k++LA%2FUfFLF%2F@coredump.intra.peff.net/

The resolution is to just keep the tests added by both sides.

  [1/3]: parse_object(): allow skipping hash check
  [2/3]: upload-pack: skip parse-object re-hashing of "want" objects
  [3/3]: parse_object(): check commit-graph when skip_hash set

 object.c        | 21 ++++++++++++++++++---
 object.h        |  6 ++++++
 revision.c      | 14 +++-----------
 t/t1450-fsck.sh | 20 ++++++++++++++++++++
 upload-pack.c   |  8 ++------
 5 files changed, 49 insertions(+), 20 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/3] parse_object(): allow skipping hash check
  2022-09-06 22:58           ` [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone Jeff King
@ 2022-09-06 23:01             ` Jeff King
  2022-09-07 14:15               ` Derrick Stolee
  2022-09-06 23:05             ` [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects Jeff King
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 40+ messages in thread
From: Jeff King @ 2022-09-06 23:01 UTC (permalink / raw)
  To: 程洋
  Cc: Derrick Stolee, git@vger.kernel.org, 何浩,
	Xin7 Ma 马鑫, 石奉兵,
	凡军辉, 王汉基

The parse_object() function checks the object hash of any object it
parses. This is a nice feature, as it means we may catch bit corruption
during normal use, rather than waiting for specific fsck operations.

But it also can be slow. It's particularly noticeable for blobs, where
except for the hash check, we could return without loading the object
contents at all. Now one may wonder what is the point of calling
parse_object() on a blob in the first place then, but usually it's not
intentional: we were fed an oid from somewhere, don't know the type, and
want an object struct. For commits and trees, the parsing is usually
helpful; we're about to look at the contents anyway. But this is less
true for blobs, where we may be collecting them as part of a
reachability traversal, etc, and don't actually care what's in them. And
blobs, of course, tend to be larger.

We don't want to just throw out the hash-checks for blobs, though. We do
depend on them in some circumstances (e.g., rev-list --verify-objects
uses parse_object() to check them). It's only the callers that know
how they're going to use the result. And so we can help them by
providing a special flag to skip the hash check.

We could just apply this to blobs, as they're going to be the main
source of performance improvement. But if a caller doesn't care about
checking the hash, we might as well skip it for other object types, too.
Even though we can't avoid reading the object contents, we can still
skip the actual hash computation.

If this seems like it is making Git a little bit less safe against
corruption, it may be. But it's part of a series of tradeoffs we're
already making. For instance, "rev-list --objects" does not open the
contents of blobs it prints. And when a commit graph is present, we skip
opening most commits entirely. The important thing will be to use this
flag in cases where it's safe to skip the check. For instance, when
serving a pack for a fetch, we know the client will fully index the
objects and do a connectivity check itself. There's little to be gained
from the server side re-hashing a blob itself. And indeed, most of the
time we don't! The revision machinery won't open up a blob reached by
traversal, but only one requested directly with a "want" line. So
applied properly, this new feature shouldn't make anything less safe in
practice.

Signed-off-by: Jeff King <peff@peff.net>
---
I'm sorry, I know the argument here is really hand-wavy. But I really
think this isn't making anything much less safe.

I was actually tempted to rip out the blob hash-check entirely by
default!  Anybody who really cares about checking the bits can do so
with read_object_file(). That's what fsck does, and we could pretty
easily convert "rev-list --verify-objects" to do so, too. So this is the
less extreme version of the patch. ;)

 object.c | 15 ++++++++++++---
 object.h |  6 ++++++
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/object.c b/object.c
index 588b8156f1..8f6de09078 100644
--- a/object.c
+++ b/object.c
@@ -263,8 +263,11 @@ struct object *parse_object_or_die(const struct object_id *oid,
 	die(_("unable to parse object: %s"), name ? name : oid_to_hex(oid));
 }

-struct object *parse_object(struct repository *r, const struct object_id *oid)
+struct object *parse_object_with_flags(struct repository *r,
+				       const struct object_id *oid,
+				       enum parse_object_flags flags)
 {
+	int skip_hash = !!(flags & PARSE_OBJECT_SKIP_HASH_CHECK);
 	unsigned long size;
 	enum object_type type;
 	int eaten;
@@ -279,7 +282,7 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 	if ((obj && obj->type == OBJ_BLOB && repo_has_object_file(r, oid)) ||
 	    (!obj && repo_has_object_file(r, oid) &&
 	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
-		if (stream_object_signature(r, repl) < 0) {
+		if (!skip_hash && stream_object_signature(r, repl) < 0) {
 			error(_("hash mismatch %s"), oid_to_hex(oid));
 			return NULL;
 		}
@@ -289,7 +292,8 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)

 	buffer = repo_read_object_file(r, oid, &type, &size);
 	if (buffer) {
-		if (check_object_signature(r, repl, buffer, size, type) < 0) {
+		if (!skip_hash &&
+		    check_object_signature(r, repl, buffer, size, type) < 0) {
 			free(buffer);
 			error(_("hash mismatch %s"), oid_to_hex(repl));
 			return NULL;
@@ -304,6 +308,11 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 	return NULL;
 }

+struct object *parse_object(struct repository *r, const struct object_id *oid)
+{
+	return parse_object_with_flags(r, oid, 0);
+}
+
 struct object_list *object_list_insert(struct object *item,
 				       struct object_list **list_p)
 {
diff --git a/object.h b/object.h
index 9caef89f1f..31ebe11458 100644
--- a/object.h
+++ b/object.h
@@ -128,7 +128,13 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet);
  *
  * Returns NULL if the object is missing or corrupt.
  */
+enum parse_object_flags {
+	PARSE_OBJECT_SKIP_HASH_CHECK = 1 << 0,
+};
 struct object *parse_object(struct repository *r, const struct object_id *oid);
+struct object *parse_object_with_flags(struct repository *r,
+				       const struct object_id *oid,
+				       enum parse_object_flags flags);

 /*
  * Like parse_object, but will die() instead of returning NULL. If the
-- 
2.37.3.1134.gfd534b3986

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects
  2022-09-06 22:58           ` [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone Jeff King
  2022-09-06 23:01             ` [PATCH 1/3] parse_object(): allow skipping hash check Jeff King
@ 2022-09-06 23:05             ` Jeff King
  2022-09-07 14:36               ` Derrick Stolee
  2022-09-07 19:26               ` Junio C Hamano
  2022-09-06 23:06             ` [PATCH 3/3] parse_object(): check commit-graph when skip_hash set Jeff King
  2022-09-07 14:48             ` [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone Derrick Stolee
  3 siblings, 2 replies; 40+ messages in thread
From: Jeff King @ 2022-09-06 23:05 UTC (permalink / raw)
  To: 程洋
  Cc: Derrick Stolee, git@vger.kernel.org, 何浩,
	Xin7 Ma 马鑫, 石奉兵,
	凡军辉, 王汉基

Imagine we have a history with commit C pointing to a large blob B.
If a client asks us for C, we can generally serve both objects to them
without accessing the uncompressed contents of B. In upload-pack, we
figure out which commits we have and what the client has, and feed those
tips to pack-objects. In pack-objects, we traverse the commits and trees
(or use bitmaps!) to find the set of objects needed, but we never open
up B. When we serve it to the client, we can often pass the compressed
bytes directly from the on-disk packfile over the wire.

But if a client asks us directly for B, perhaps because they are doing
an on-demand fetch to fill in the missing blob of a partial clone, we
end up much slower. Upload-pack calls parse_object() on the oid we
receive, which opens up the object and re-checks its hash (even though
if it were a commit, we might skip this parse entirely in favor of the
commit graph!). And then we feed the oid directly to pack-objects, which
again calls parse_object() and opens the object. And then finally, when
we write out the result, we may send bytes straight from disk, but only
after having unnecessarily uncompressed and computed the sha1 of the
object twice!

This patch teaches both code paths to use the new SKIP_HASH_CHECK flag
for parse_object(). You can see the speed-up in p5600, which does a
blob:none clone followed by a checkout. The savings for git.git are
modest:

  Test                          HEAD^             HEAD
  ----------------------------------------------------------------------
  5600.3: checkout of result    2.23(4.19+0.24)   1.72(3.79+0.18) -22.9%

But the savings scale with the number of bytes. So on a repository like
linux.git with more files, we see more improvement (in both absolute and
relative numbers):

  Test                          HEAD^                HEAD
  ----------------------------------------------------------------------------
  5600.3: checkout of result    51.62(77.26+2.76)    34.86(61.41+2.63) -32.5%

And here's an even more extreme case. This is the android gradle-plugin
repository, whose tip checkout has ~3.7GB of files:

  Test                          HEAD^               HEAD
  --------------------------------------------------------------------------
  5600.3: checkout of result    79.51(90.84+5.55)   40.28(51.88+5.67) -49.3%

Keep in mind that these timings are of the whole checkout operation. So
they count the client indexing the pack and actually writing out the
files. If we want to see just the server's view, we can hack up the
GIT_TRACE_PACKET output from those operations and replay it via
upload-pack. For the gradle example, that gives me:

  Benchmark 1: GIT_PROTOCOL=version=2 git.old upload-pack ../gradle-plugin <input
    Time (mean ± σ):     50.884 s ±  0.239 s    [User: 51.450 s, System: 1.726 s]
    Range (min … max):   50.608 s … 51.025 s    3 runs

  Benchmark 2: GIT_PROTOCOL=version=2 git.new upload-pack ../gradle-plugin <input
    Time (mean ± σ):      9.728 s ±  0.112 s    [User: 10.466 s, System: 1.535 s]
    Range (min … max):    9.618 s …  9.842 s    3 runs

  Summary
    'GIT_PROTOCOL=version=2 git.new upload-pack ../gradle-plugin <input' ran
      5.23 ± 0.07 times faster than 'GIT_PROTOCOL=version=2 git.old upload-pack ../gradle-plugin <input'

So a server would see an 80% reduction in CPU serving the initial
checkout of a partial clone for this repository. Or possibly even more
depending on the packing; most of the time spent in the faster one were
objects we had to open during the write phase.

In both cases skipping the extra hashing on the server should be pretty
safe. The client doesn't trust the server anyway, so it will re-hash all
of the objects via index-pack. There is one thing to note, though: the
change in get_reference() affects not just pack-objects, but rev-list,
git-log, etc. We could use a flag to limit to index-pack here, but we
may already skip hash checks in this instance. For commits, we'd skip
anything we load via the commit-graph. And while before this commit we
would check a blob fed directly to rev-list on the command-line, we'd
skip checking that same blob if we found it by traversing a tree.

The exception for both is if --verify-objects is used. In that case,
we'll skip this optimization, and the new test makes sure we do this
correctly.

Signed-off-by: Jeff King <peff@peff.net>
---
 revision.c      |  4 +++-
 t/t1450-fsck.sh | 20 ++++++++++++++++++++
 upload-pack.c   |  3 ++-
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/revision.c b/revision.c
index ee702e498a..786e090785 100644
--- a/revision.c
+++ b/revision.c
@@ -384,7 +384,9 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
 	if (commit)
 		object = &commit->object;
 	else
-		object = parse_object(revs->repo, oid);
+		object = parse_object_with_flags(revs->repo, oid,
+						 revs->verify_objects ? 0 :
+						 PARSE_OBJECT_SKIP_HASH_CHECK);
 
 	if (!object) {
 		if (revs->ignore_missing)
diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index 53c2aa10b7..6410eff4e0 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -507,6 +507,26 @@ test_expect_success 'rev-list --verify-objects with bad sha1' '
 	test_i18ngrep -q "error: hash mismatch $(dirname $new)$(test_oid ff_2)" out
 '
 
+# An actual bit corruption is more likely than swapped commits, but
+# this provides an easy way to have commits which don't match their purported
+# hashes, but which aren't so broken we can't read them at all.
+test_expect_success 'rev-list --verify-objects notices swapped commits' '
+	git init swapped-commits &&
+	(
+		cd swapped-commits &&
+		test_commit one &&
+		test_commit two &&
+		one_oid=$(git rev-parse HEAD) &&
+		two_oid=$(git rev-parse HEAD^) &&
+		one=.git/objects/$(test_oid_to_path $one_oid) &&
+		two=.git/objects/$(test_oid_to_path $two_oid) &&
+		mv $one tmp &&
+		mv $two $one &&
+		mv tmp $two &&
+		test_must_fail git rev-list --verify-objects HEAD
+	)
+'
+
 test_expect_success 'force fsck to ignore double author' '
 	git cat-file commit HEAD >basis &&
 	sed "s/^author .*/&,&/" <basis | tr , \\n >multiple-authors &&
diff --git a/upload-pack.c b/upload-pack.c
index b217a1f469..4bacdf087c 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -1420,7 +1420,8 @@ static int parse_want(struct packet_writer *writer, const char *line,
 		if (commit)
 			o = &commit->object;
 		else
-			o = parse_object(the_repository, &oid);
+			o = parse_object_with_flags(the_repository, &oid,
+						    PARSE_OBJECT_SKIP_HASH_CHECK);
 
 		if (!o) {
 			packet_writer_error(writer,
-- 
2.37.3.1134.gfd534b3986


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/3] parse_object(): check commit-graph when skip_hash set
  2022-09-06 22:58           ` [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone Jeff King
  2022-09-06 23:01             ` [PATCH 1/3] parse_object(): allow skipping hash check Jeff King
  2022-09-06 23:05             ` [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects Jeff King
@ 2022-09-06 23:06             ` Jeff King
  2022-09-07 14:46               ` Derrick Stolee
  2022-09-07 19:31               ` Junio C Hamano
  2022-09-07 14:48             ` [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone Derrick Stolee
  3 siblings, 2 replies; 40+ messages in thread
From: Jeff King @ 2022-09-06 23:06 UTC (permalink / raw)
  To: 程洋
  Cc: Derrick Stolee, git@vger.kernel.org, 何浩,
	Xin7 Ma 马鑫, 石奉兵,
	凡军辉, 王汉基

If the caller told us that they don't care about us checking the object
hash, then we're free to implement any optimizations that get us the
parsed value more quickly. An obvious one is to check the commit graph
before loading an object from disk. And in fact, both of the callers who
pass in this flag are already doing so before they call parse_object()!

So we can simplify those callers, as well as any possible future ones,
by moving the logic into parse_object().

There are two subtle things to note in the diff, but neither has any
impact in practice:

  - it seems least-surprising here to do the graph lookup on the
    git-replace'd oid, rather than the original. This is in theory a
    change of behavior from the earlier code, as neither caller did a
    replace lookup itself. But in practice it doesn't matter, as we
    disable the commit graph entirely if there are any replace refs.

  - the caller in get_reference() passes the skip_hash flag only if
    revs->verify_objects isn't set, whereas it would look in the commit
    graph unconditionally. In practice this should not matter as we
    should disable the commit graph entirely when using verify_objects
    (and that was done recently in another patch).

So this should be a pure cleanup with no behavior change.

Signed-off-by: Jeff King <peff@peff.net>
---
 object.c      |  6 ++++++
 revision.c    | 16 +++-------------
 upload-pack.c |  9 ++-------
 3 files changed, 11 insertions(+), 20 deletions(-)

diff --git a/object.c b/object.c
index 8f6de09078..2e4589bae5 100644
--- a/object.c
+++ b/object.c
@@ -279,6 +279,12 @@ struct object *parse_object_with_flags(struct repository *r,
 	if (obj && obj->parsed)
 		return obj;
 
+	if (skip_hash) {
+		struct commit *commit = lookup_commit_in_graph(r, repl);
+		if (commit)
+			return &commit->object;
+	}
+
 	if ((obj && obj->type == OBJ_BLOB && repo_has_object_file(r, oid)) ||
 	    (!obj && repo_has_object_file(r, oid) &&
 	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
diff --git a/revision.c b/revision.c
index 786e090785..a04ebd6139 100644
--- a/revision.c
+++ b/revision.c
@@ -373,20 +373,10 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
 				    unsigned int flags)
 {
 	struct object *object;
-	struct commit *commit;
 
-	/*
-	 * If the repository has commit graphs, we try to opportunistically
-	 * look up the object ID in those graphs. Like this, we can avoid
-	 * parsing commit data from disk.
-	 */
-	commit = lookup_commit_in_graph(revs->repo, oid);
-	if (commit)
-		object = &commit->object;
-	else
-		object = parse_object_with_flags(revs->repo, oid,
-						 revs->verify_objects ? 0 :
-						 PARSE_OBJECT_SKIP_HASH_CHECK);
+	object = parse_object_with_flags(revs->repo, oid,
+					 revs->verify_objects ? 0 :
+					 PARSE_OBJECT_SKIP_HASH_CHECK);
 
 	if (!object) {
 		if (revs->ignore_missing)
diff --git a/upload-pack.c b/upload-pack.c
index 4bacdf087c..35fe1a3dbb 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -1409,19 +1409,14 @@ static int parse_want(struct packet_writer *writer, const char *line,
 	const char *arg;
 	if (skip_prefix(line, "want ", &arg)) {
 		struct object_id oid;
-		struct commit *commit;
 		struct object *o;
 
 		if (get_oid_hex(arg, &oid))
 			die("git upload-pack: protocol error, "
 			    "expected to get oid, not '%s'", line);
 
-		commit = lookup_commit_in_graph(the_repository, &oid);
-		if (commit)
-			o = &commit->object;
-		else
-			o = parse_object_with_flags(the_repository, &oid,
-						    PARSE_OBJECT_SKIP_HASH_CHECK);
+		o = parse_object_with_flags(the_repository, &oid,
+					    PARSE_OBJECT_SKIP_HASH_CHECK);
 
 		if (!o) {
 			packet_writer_error(writer,
-- 
2.37.3.1134.gfd534b3986

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/3] parse_object(): allow skipping hash check
  2022-09-06 23:01             ` [PATCH 1/3] parse_object(): allow skipping hash check Jeff King
@ 2022-09-07 14:15               ` Derrick Stolee
  2022-09-07 20:44                 ` Jeff King
  0 siblings, 1 reply; 40+ messages in thread
From: Derrick Stolee @ 2022-09-07 14:15 UTC (permalink / raw)
  To: Jeff King, 程洋
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

On 9/6/2022 7:01 PM, Jeff King wrote:
> I'm sorry, I know the argument here is really hand-wavy. But I really
> think this isn't making anything much less safe.

I agree with you that this is the safest way to move forward here.

> I was actually tempted to rip out the blob hash-check entirely by
> default!  Anybody who really cares about checking the bits can do so
> with read_object_file(). That's what fsck does, and we could pretty
> easily convert "rev-list --verify-objects" to do so, too. So this is the
> less extreme version of the patch. ;)

A quick search shows many uses of parse_object() across the codebase.
It would certainly be nice if they all suddenly got faster by avoiding
this hashing, but I also suppose that most of the calls are using
parse_object() only because they are unsure if they are parsing a
commit or a tag and would never parse a large blob.

I think this approach of making parse_object_with_flags() is the best
way to incrementally approach things here. If we decide that we need
the _with_flags() version specifically to avoid this hash check, then
we could probably take the second approach: remove the hash check from
parse_object() and swap the places that care to use read_object_file()
instead. My guess is that in the long term there will be fewer swaps
to read_object_file() than to parse_object_with_flags().

However, this is a good first step to make progress without doing the
time-consuming audit of every caller to parse_object().

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects
  2022-09-06 23:05             ` [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects Jeff King
@ 2022-09-07 14:36               ` Derrick Stolee
  2022-09-07 14:45                 ` Derrick Stolee
  2022-09-07 19:26               ` Junio C Hamano
  1 sibling, 1 reply; 40+ messages in thread
From: Derrick Stolee @ 2022-09-07 14:36 UTC (permalink / raw)
  To: Jeff King, 程洋
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

On 9/6/2022 7:05 PM, Jeff King wrote:
> This patch teaches both code paths to use the new SKIP_HASH_CHECK flag
> for parse_object(). You can see the speed-up in p5600, which does a
> blob:none clone followed by a checkout. The savings for git.git are
> modest:
> 
>   Test                          HEAD^             HEAD
>   ----------------------------------------------------------------------
>   5600.3: checkout of result    2.23(4.19+0.24)   1.72(3.79+0.18) -22.9%
> 
> But the savings scale with the number of bytes. So on a repository like
> linux.git with more files, we see more improvement (in both absolute and
> relative numbers):
> 
>   Test                          HEAD^                HEAD
>   ----------------------------------------------------------------------------
>   5600.3: checkout of result    51.62(77.26+2.76)    34.86(61.41+2.63) -32.5%
> 
> And here's an even more extreme case. This is the android gradle-plugin
> repository, whose tip checkout has ~3.7GB of files:
> 
>   Test                          HEAD^               HEAD
>   --------------------------------------------------------------------------
>   5600.3: checkout of result    79.51(90.84+5.55)   40.28(51.88+5.67) -49.3%
> 
> Keep in mind that these timings are of the whole checkout operation.

These numbers are _very_ impressive for this user-facing scenario.

>   Benchmark 1: GIT_PROTOCOL=version=2 git.old upload-pack ../gradle-plugin <input
>     Time (mean ± σ):     50.884 s ±  0.239 s    [User: 51.450 s, System: 1.726 s]
>     Range (min … max):   50.608 s … 51.025 s    3 runs
> 
>   Benchmark 2: GIT_PROTOCOL=version=2 git.new upload-pack ../gradle-plugin <input
>     Time (mean ± σ):      9.728 s ±  0.112 s    [User: 10.466 s, System: 1.535 s]
>     Range (min … max):    9.618 s …  9.842 s    3 runs
> 
>   Summary
>     'GIT_PROTOCOL=version=2 git.new upload-pack ../gradle-plugin <input' ran
>       5.23 ± 0.07 times faster than 'GIT_PROTOCOL=version=2 git.old upload-pack ../gradle-plugin <input'
> 
> So a server would see an 80% reduction in CPU serving the initial
> checkout of a partial clone for this repository.

Very nice!

> In both cases skipping the extra hashing on the server should be pretty
> safe. The client doesn't trust the server anyway, so it will re-hash all
> of the objects via index-pack.

Definitely safe for this target scenario.

> There is one thing to note, though: the
> change in get_reference() affects not just pack-objects, but rev-list,
> git-log, etc. We could use a flag to limit to index-pack here, but we
> may already skip hash checks in this instance. For commits, we'd skip
> anything we load via the commit-graph. And while before this commit we
> would check a blob fed directly to rev-list on the command-line, we'd
> skip checking that same blob if we found it by traversing a tree.

It's worth also mentioning that since you isolated the change to
get_reference(), it is only skipping the parse for the requested tips
of the revision walk. As we follow commits and trees to other objects
we do not use parse_object() but instead use the appropriate method
(parse_commit(), parse_tree(), and do not even parse blobs).

This is additionally confirmed that p0001-rev-list.sh has no measurable
change in results with this commit, even when disabling the commit-graph.

> The exception for both is if --verify-objects is used. In that case,
> we'll skip this optimization, and the new test makes sure we do this
> correctly.

The test for this is clever. Thanks for adding it.

Code and test look good.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects
  2022-09-07 14:36               ` Derrick Stolee
@ 2022-09-07 14:45                 ` Derrick Stolee
  2022-09-07 20:50                   ` Jeff King
  0 siblings, 1 reply; 40+ messages in thread
From: Derrick Stolee @ 2022-09-07 14:45 UTC (permalink / raw)
  To: Jeff King, 程洋
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

On 9/7/2022 10:36 AM, Derrick Stolee wrote:
> On 9/6/2022 7:05 PM, Jeff King wrote:

>> There is one thing to note, though: the
>> change in get_reference() affects not just pack-objects, but rev-list,
>> git-log, etc. We could use a flag to limit to index-pack here, but we
>> may already skip hash checks in this instance. For commits, we'd skip
>> anything we load via the commit-graph. And while before this commit we
>> would check a blob fed directly to rev-list on the command-line, we'd
>> skip checking that same blob if we found it by traversing a tree.
> 
> It's worth also mentioning that since you isolated the change to
> get_reference(), it is only skipping the parse for the requested tips
> of the revision walk. As we follow commits and trees to other objects
> we do not use parse_object() but instead use the appropriate method
> (parse_commit(), parse_tree(), and do not even parse blobs).

After writing this, it was bothering me that 'rev-list --verify' should
be checking for corruption throughout the history, not just at the tips.

This is done via the condition in builtin/rev-list.c:finish_object():

static int finish_object(struct object *obj, const char *name, void *cb_data)
{
	struct rev_list_info *info = cb_data;
	if (oid_object_info_extended(the_repository, &obj->oid, NULL, 0) < 0) {
		finish_object__ma(obj);
		return 1;
	}
	if (info->revs->verify_objects && !obj->parsed && obj->type != OBJ_COMMIT)
		parse_object(the_repository, &obj->oid);
	return 0;
}

So this call is the one that is used most-often by the rev-list command,
but isn't necessary to update for the purpose of our desired speedup. This
is another place where we would want to use read_object_file(). (It may even
be the _only_ place, since finish_object() should be called even for the
tip objects.)

In case you think it is valuable to ensure we test both cases ("tip" and
"not tip") I modified your test to have a third commit and test two rev-list
calls:

# An actual bit corruption is more likely than swapped commits, but
# this provides an easy way to have commits which don't match their purported
# hashes, but which aren't so broken we can't read them at all.
test_expect_success 'rev-list --verify-objects notices swapped commits' '
	git init swapped-commits &&
	(
		cd swapped-commits &&
		test_commit one &&
		test_commit two &&
		test_commit three &&
		one_oid=$(git rev-parse HEAD) &&
		two_oid=$(git rev-parse HEAD^) &&
		one=.git/objects/$(test_oid_to_path $one_oid) &&
		two=.git/objects/$(test_oid_to_path $two_oid) &&
		mv $one tmp &&
		mv $two $one &&
		mv tmp $two &&
		test_must_fail git rev-list --verify-objects HEAD &&
		test_must_fail git rev-list --verify-objects HEAD^
	)
'

But this is likely overkill.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] parse_object(): check commit-graph when skip_hash set
  2022-09-06 23:06             ` [PATCH 3/3] parse_object(): check commit-graph when skip_hash set Jeff King
@ 2022-09-07 14:46               ` Derrick Stolee
  2022-09-07 19:31               ` Junio C Hamano
  1 sibling, 0 replies; 40+ messages in thread
From: Derrick Stolee @ 2022-09-07 14:46 UTC (permalink / raw)
  To: Jeff King, 程洋
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

On 9/6/2022 7:06 PM, Jeff King wrote:
> If the caller told us that they don't care about us checking the object
> hash, then we're free to implement any optimizations that get us the
> parsed value more quickly. An obvious one is to check the commit graph
> before loading an object from disk. And in fact, both of the callers who
> pass in this flag are already doing so before they call parse_object()!
> 
> So we can simplify those callers, as well as any possible future ones,
> by moving the logic into parse_object().

Nice!
 
> There are two subtle things to note in the diff, but neither has any
> impact in practice:
> 
>   - it seems least-surprising here to do the graph lookup on the
>     git-replace'd oid, rather than the original. This is in theory a
>     change of behavior from the earlier code, as neither caller did a
>     replace lookup itself. But in practice it doesn't matter, as we
>     disable the commit graph entirely if there are any replace refs.

I can confirm that this is the case.

>   - the caller in get_reference() passes the skip_hash flag only if
>     revs->verify_objects isn't set, whereas it would look in the commit
>     graph unconditionally. In practice this should not matter as we
>     should disable the commit graph entirely when using verify_objects
>     (and that was done recently in another patch).
> 
> So this should be a pure cleanup with no behavior change.

Excellent, thanks!

-Stolee


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone
  2022-09-06 22:58           ` [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone Jeff King
                               ` (2 preceding siblings ...)
  2022-09-06 23:06             ` [PATCH 3/3] parse_object(): check commit-graph when skip_hash set Jeff King
@ 2022-09-07 14:48             ` Derrick Stolee
  3 siblings, 0 replies; 40+ messages in thread
From: Derrick Stolee @ 2022-09-07 14:48 UTC (permalink / raw)
  To: Jeff King, 程洋
  Cc: git@vger.kernel.org, 何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

On 9/6/2022 6:58 PM, Jeff King wrote:
> On Tue, Sep 06, 2022 at 02:38:41PM -0400, Jeff King wrote:
> 
>> On Mon, Sep 05, 2022 at 11:17:21AM +0000, 程洋 wrote:
>>
>>> Sorry, I told you the wrong branch. It should be "android-t-preview-1"
>>> git clone --filter=blob:none --no-local -b android-t-preview-1 grade-plugin
>>>
>>> Can you try this one?
>>
>> Yes, I see more slow-down there. There are many more blobs there, but I
>> don't think it's really the number of them, but their sizes.
>>
>> The problem is that both upload-pack and pack-objects are keen to call
>> parse_object() on their inputs. For commits, etc, that is usually
>> sensible; we have to parse the object to see what it points to. But for
>> blobs, the only thing we do is inflate a ton of bytes in order to check
>> the sha1. That's not really productive here; if there is a bit
>> corruption, the client will notice it on the receiving side.

Thanks for finding this very subtle issue!
 
> So here's a cleaned-up series which makes this a lot faster.
> 
> The special sauce is in patch 2, along with timings. The first one is
> just preparing, and the final one is a small cleanup it enables.

I carefully read these patches as well as applied them on my machine
and did some extra digging and performance tests to understand the
change.

LGTM.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects
  2022-09-06 23:05             ` [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects Jeff King
  2022-09-07 14:36               ` Derrick Stolee
@ 2022-09-07 19:26               ` Junio C Hamano
  2022-09-07 20:36                 ` Jeff King
  1 sibling, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2022-09-07 19:26 UTC (permalink / raw)
  To: Jeff King
  Cc: 程洋, Derrick Stolee, git@vger.kernel.org,
	何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

Jeff King <peff@peff.net> writes:

> The exception for both is if --verify-objects is used. In that case,
> we'll skip this optimization, and the new test makes sure we do this
> correctly.

I wondered if we want to test the change on the "upload-pack" side
by going in to the swapped-commits repository, running upload-pack
manually and seeing that it spews unusable output without failing,
but it probably is not worth the effort.  We have plenty of tests
that exercises upload-pack in "good" cases.  What might be a good
test is to try fetching from swapped-commits repository and make
sure that index-pack on the receiving end notices, but I suspect we
already have such a "fetch/clone from a corrupt repository" test,
in which case we do not have to add one.

Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/3] parse_object(): check commit-graph when skip_hash set
  2022-09-06 23:06             ` [PATCH 3/3] parse_object(): check commit-graph when skip_hash set Jeff King
  2022-09-07 14:46               ` Derrick Stolee
@ 2022-09-07 19:31               ` Junio C Hamano
  2022-09-08 10:39                 ` [External Mail]Re: " 程洋
  1 sibling, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2022-09-07 19:31 UTC (permalink / raw)
  To: Jeff King
  Cc: 程洋, Derrick Stolee, git@vger.kernel.org,
	何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

Jeff King <peff@peff.net> writes:

> If the caller told us that they don't care about us checking the object
> hash, then we're free to implement any optimizations that get us the
> parsed value more quickly. An obvious one is to check the commit graph
> before loading an object from disk. And in fact, both of the callers who
> pass in this flag are already doing so before they call parse_object()!
>
> So we can simplify those callers, as well as any possible future ones,
> by moving the logic into parse_object().

Nicely done.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects
  2022-09-07 19:26               ` Junio C Hamano
@ 2022-09-07 20:36                 ` Jeff King
  2022-09-07 20:48                   ` [BUG] t1800: Fails for error text comparison rsbecker
  2022-09-07 21:02                   ` [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects Jeff King
  0 siblings, 2 replies; 40+ messages in thread
From: Jeff King @ 2022-09-07 20:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: 程洋, Derrick Stolee, git@vger.kernel.org,
	何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

On Wed, Sep 07, 2022 at 12:26:41PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > The exception for both is if --verify-objects is used. In that case,
> > we'll skip this optimization, and the new test makes sure we do this
> > correctly.
> 
> I wondered if we want to test the change on the "upload-pack" side
> by going in to the swapped-commits repository, running upload-pack
> manually and seeing that it spews unusable output without failing,
> but it probably is not worth the effort.  We have plenty of tests
> that exercises upload-pack in "good" cases.  What might be a good
> test is to try fetching from swapped-commits repository and make
> sure that index-pack on the receiving end notices, but I suspect we
> already have such a "fetch/clone from a corrupt repository" test,
> in which case we do not have to add one.

There are some tests in t1060 that cover transfer of corrupted objects.
But one of the tricky things about corruption is that it can be somewhat
arbitrary where and when we notice. I.e., whether upload-pack notices
the problem and bails or whether it spews an invalid result, we're OK
with either outcome, and a test for it can end up rather fragile.

What we do care about is that _somebody_ notices the problem. t1060
covers that for some cases, though of course there's no partial clone
there. If we add one like this:

diff --git a/t/t1060-object-corruption.sh b/t/t1060-object-corruption.sh
index 5b8e47e346..fd18a8b29e 100755
--- a/t/t1060-object-corruption.sh
+++ b/t/t1060-object-corruption.sh
@@ -139,4 +139,11 @@ test_expect_success 'internal tree objects are not "missing"' '
 	)
 '

+test_expect_success 'partial clone of corrupted reository' '
+	test_config -C bit-error uploadpack.allowFilter true &&
+	git clone --no-local --no-checkout --filter=blob:none \
+		bit-error corrupt-partial && \
+	test_must_fail git -C corrupt-partial checkout --force
+'
+
 test_done

we can see that the initial blob:none clone is OK (since neither side
looks at the corrupted blob at all). And then the checkout does barf as
expected. But it looks like we catch the error in upload-pack, probably
because the loose object is so corrupted that we cannot even access its
type header. I tried moving the corruption to further in the file, but I
think it's simply so small that zlib will read the whole input and
complain about the bogus crc.

Hmm. Looks like that script has another corrupted repo where an object
is misnamed. So if we s/bit-error/misnamed/ in the test above, then we
do trigger the new code. Before we get:

  error: hash mismatch d95f3ad14dee633a758d2e331151e950dd13e4ed
  fatal: git upload-pack: not our ref d95f3ad14dee633a758d2e331151e950dd13e4ed
  fatal: remote error: upload-pack: not our ref d95f3ad14dee633a758d2e331151e950dd13e4ed

(the error is a bit misleading, but I guess the inability to create the
object struct means our "is it our ref" code gets confused). After my
patches, we get:

  remote: Enumerating objects: 1, done.
  remote: Counting objects: 100% (1/1), done.
  remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
  Receiving objects: 100% (1/1), 49 bytes | 49.00 KiB/s, done.
  fatal: bad revision 'd95f3ad14dee633a758d2e331151e950dd13e4ed'
  error: [...]/misnamed did not send all necessary objects

Is that worth having? I dunno. It's kind of brittle in that a later
change could mean we're finding the corruption elsewhere, and not
checking exactly what we think we are. OTOH, it probably doesn't hurt to
cover more cases here, and it's not a very expensive test.

-Peff

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/3] parse_object(): allow skipping hash check
  2022-09-07 14:15               ` Derrick Stolee
@ 2022-09-07 20:44                 ` Jeff King
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff King @ 2022-09-07 20:44 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: 程洋, git@vger.kernel.org, 何浩,
	Xin7 Ma 马鑫, 石奉兵,
	凡军辉, 王汉基

On Wed, Sep 07, 2022 at 10:15:37AM -0400, Derrick Stolee wrote:

> A quick search shows many uses of parse_object() across the codebase.
> It would certainly be nice if they all suddenly got faster by avoiding
> this hashing, but I also suppose that most of the calls are using
> parse_object() only because they are unsure if they are parsing a
> commit or a tag and would never parse a large blob.

Yeah. I think we use parse_object() as a catch-all for "somebody gave us
an oid, and we need to know what it is". I suspect that most normal uses
would not get much faster, because we typically do feed it non-blob
objects that are small, and whose contents we need to access anyway to
parse them. So we're paying only the overhead of sha1 on a buffer we
already have in memory.

In cases where we might not need the parsed contents at all, our best
bet is to actually remove or delay the parsing entirely. E.g., I think
upload-pack used to be aggressive about parsing the ref tips that it
advertised, but really we can just tell the client about them and only
parse the ones they ask for.

> I think this approach of making parse_object_with_flags() is the best
> way to incrementally approach things here. If we decide that we need
> the _with_flags() version specifically to avoid this hash check, then
> we could probably take the second approach: remove the hash check from
> parse_object() and swap the places that care to use read_object_file()
> instead. My guess is that in the long term there will be fewer swaps
> to read_object_file() than to parse_object_with_flags().
> 
> However, this is a good first step to make progress without doing the
> time-consuming audit of every caller to parse_object().

The notion that this hash check in parse_object() might be slow is
certainly not new. I've been thinking about it for at least a decade. ;)
But until this recent case of direct-fetching blobs, I hadn't seen an
instance where it really made a significant and measurable difference.

So I'm definitely not opposed to going to a world where we drop the
extra hash checks entirely, if that buys us something. The
incrementalism is conservative, but it also makes it easy to
convert specific call-sites to measure the outcomes.

-Peff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [BUG] t1800: Fails for error text comparison
  2022-09-07 20:36                 ` Jeff King
@ 2022-09-07 20:48                   ` rsbecker
  2022-09-07 21:55                     ` Junio C Hamano
  2022-09-07 21:02                   ` [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects Jeff King
  1 sibling, 1 reply; 40+ messages in thread
From: rsbecker @ 2022-09-07 20:48 UTC (permalink / raw)
  To: git

I am finding an issue with t1800.16 failing as a result of a text compare:

-fatal: cannot run bad-hooks/test-hook: ...
+fatal: cannot exec 'bad-hooks/test-hook': Permission denied

I don't think this is actually a failure condition but the message text is platform and shell specific.

Is there any clean way around this?

Thanks,
Randall
--
Brief whoami: NonStop&UNIX developer since approximately
UNIX(421664400)
NonStop(211288444200000000)
-- In real life, I talk too much.




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects
  2022-09-07 14:45                 ` Derrick Stolee
@ 2022-09-07 20:50                   ` Jeff King
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff King @ 2022-09-07 20:50 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: 程洋, git@vger.kernel.org, 何浩,
	Xin7 Ma 马鑫, 石奉兵,
	凡军辉, 王汉基

On Wed, Sep 07, 2022 at 10:45:39AM -0400, Derrick Stolee wrote:

> After writing this, it was bothering me that 'rev-list --verify' should
> be checking for corruption throughout the history, not just at the tips.
> 
> This is done via the condition in builtin/rev-list.c:finish_object():
> 
> static int finish_object(struct object *obj, const char *name, void *cb_data)
> {
> 	struct rev_list_info *info = cb_data;
> 	if (oid_object_info_extended(the_repository, &obj->oid, NULL, 0) < 0) {
> 		finish_object__ma(obj);
> 		return 1;
> 	}
> 	if (info->revs->verify_objects && !obj->parsed && obj->type != OBJ_COMMIT)
> 		parse_object(the_repository, &obj->oid);
> 	return 0;
> }
> 
> So this call is the one that is used most-often by the rev-list command,
> but isn't necessary to update for the purpose of our desired speedup. This
> is another place where we would want to use read_object_file(). (It may even
> be the _only_ place, since finish_object() should be called even for the
> tip objects.)

Yeah, I think this is the only place. At least, if you remove the hash
check entirely, then this parse_object() is the only spots that causes a
problem (via the existing test in t1450).

> In case you think it is valuable to ensure we test both cases ("tip" and
> "not tip") I modified your test to have a third commit and test two rev-list
> calls:

I don't think it's important to test here, as we know we just touched
the tip code here. But also, that existing "rev-list --verify-objects"
test in t1450 is covering that case: it corrupts a blob that is
reachable from an otherwise good commit/tree.

We should also have tip and non-tip tests for commits, but I posted
those in a separate series. ;)

-Peff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects
  2022-09-07 20:36                 ` Jeff King
  2022-09-07 20:48                   ` [BUG] t1800: Fails for error text comparison rsbecker
@ 2022-09-07 21:02                   ` Jeff King
  2022-09-07 22:07                     ` Junio C Hamano
  1 sibling, 1 reply; 40+ messages in thread
From: Jeff King @ 2022-09-07 21:02 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: 程洋, Derrick Stolee, git@vger.kernel.org,
	何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

On Wed, Sep 07, 2022 at 04:36:22PM -0400, Jeff King wrote:

> Is that worth having? I dunno. It's kind of brittle in that a later
> change could mean we're finding the corruption elsewhere, and not
> checking exactly what we think we are. OTOH, it probably doesn't hurt to
> cover more cases here, and it's not a very expensive test.

So here's what it would look like as a patch on top.

It could also be squashed in, of course, which saves the vague reference
to "a recent commit...". OTOH, it's sufficiently complicated to discuss
the testing that it's nice not to overwhelm the already-long commit
message of the original. ;)

-- >8 --
Subject: [PATCH] t1060: check partial clone of misnamed blob

A recent commit (upload-pack: skip parse-object re-hashing of "want"
objects, 2022-09-06) loosened the behavior of upload-pack so that it
does not verify the sha1 of objects it receives directly via "want"
requests. The existing corruption tests in t1060 aren't affected by
this: the corruptions are blobs reachable from commits, and the client
requests the commits.

The more interesting case here is a partial clone, where the client will
directly ask for the corrupted blob when it does an on-demand fetch of
the filtered object. And that is not covered at all, so let's add a
test.

It's important here that we use the "misnamed" corruption and not
"bit-error". The latter is sufficiently corrupted that upload-pack
cannot even figure out the type of the object, so it bails identically
both before and after the recent change. But with "misnamed", with the
hash-checks enabled it sees the problem (though the error messages are a
bit confusing because of the inability to create a "struct object" to
store the flags):

  error: hash mismatch d95f3ad14dee633a758d2e331151e950dd13e4ed
  fatal: git upload-pack: not our ref d95f3ad14dee633a758d2e331151e950dd13e4ed
  fatal: remote error: upload-pack: not our ref d95f3ad14dee633a758d2e331151e950dd13e4ed

After the change to skip the hash check, the server side happily sends
the bogus object, but the client correctly realizes that it did not get
the necessary data:

  remote: Enumerating objects: 1, done.
  remote: Counting objects: 100% (1/1), done.
  remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
  Receiving objects: 100% (1/1), 49 bytes | 49.00 KiB/s, done.
  fatal: bad revision 'd95f3ad14dee633a758d2e331151e950dd13e4ed'
  error: [...]/misnamed did not send all necessary objects

which is exactly what we expect to happen.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t1060-object-corruption.sh | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/t/t1060-object-corruption.sh b/t/t1060-object-corruption.sh
index 5b8e47e346..35261afc9d 100755
--- a/t/t1060-object-corruption.sh
+++ b/t/t1060-object-corruption.sh
@@ -139,4 +139,11 @@ test_expect_success 'internal tree objects are not "missing"' '
 	)
 '

+test_expect_success 'partial clone of corrupted repository' '
+	test_config -C misnamed uploadpack.allowFilter true &&
+	git clone --no-local --no-checkout --filter=blob:none \
+		misnamed corrupt-partial && \
+	test_must_fail git -C corrupt-partial checkout --force
+'
+
 test_done
-- 
2.37.3.1138.gedcf976b26

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [BUG] t1800: Fails for error text comparison
  2022-09-07 20:48                   ` [BUG] t1800: Fails for error text comparison rsbecker
@ 2022-09-07 21:55                     ` Junio C Hamano
  2022-09-07 22:23                       ` rsbecker
  0 siblings, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2022-09-07 21:55 UTC (permalink / raw)
  To: rsbecker; +Cc: git

<rsbecker@nexbridge.com> writes:

> I am finding an issue with t1800.16 failing as a result of a text compare:
>
> -fatal: cannot run bad-hooks/test-hook: ...
> +fatal: cannot exec 'bad-hooks/test-hook': Permission denied
>
> I don't think this is actually a failure condition but the message
> text is platform and shell specific.

Isn't this coming from piece of code in start_command()?

		/*
		 * Attempt to exec using the command and arguments starting at
		 * argv.argv[1].  argv.argv[0] contains SHELL_PATH which will
		 * be used in the event exec failed with ENOEXEC at which point
		 * we will try to interpret the command using 'sh'.
		 */
		execve(argv.v[1], (char *const *) argv.v + 1,
		       (char *const *) childenv);
		if (errno == ENOEXEC)
			execve(argv.v[0], (char *const *) argv.v,
			       (char *const *) childenv);

		if (errno == ENOENT) {
			if (cmd->silent_exec_failure)
				child_die(CHILD_ERR_SILENT);
			child_die(CHILD_ERR_ENOENT);
		} else {
			child_die(CHILD_ERR_ERRNO);
		}

The test apparently expects CHILD_ERR_NOENT, which comes from
child_err_spew()

	case CHILD_ERR_ENOENT:
		error_errno("cannot run %s", cmd->args.v[0]);
		break;
	case CHILD_ERR_SILENT:
		break;
	case CHILD_ERR_ERRNO:
		error_errno("cannot exec '%s'", cmd->args.v[0]);
		break;
	}

but somehow your system fails the execve() with something other than
ENOENT.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects
  2022-09-07 21:02                   ` [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects Jeff King
@ 2022-09-07 22:07                     ` Junio C Hamano
  2022-09-08  5:04                       ` Jeff King
  0 siblings, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2022-09-07 22:07 UTC (permalink / raw)
  To: Jeff King
  Cc: 程洋, Derrick Stolee, git@vger.kernel.org,
	何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

Jeff King <peff@peff.net> writes:

> Subject: [PATCH] t1060: check partial clone of misnamed blob
>
> A recent commit (upload-pack: skip parse-object re-hashing of "want"
> objects, 2022-09-06) loosened the behavior of upload-pack so that it
> does not verify the sha1 of objects it receives directly via "want"
> requests. The existing corruption tests in t1060 aren't affected by
> this: the corruptions are blobs reachable from commits, and the client
> requests the commits.
>
> The more interesting case here is a partial clone, where the client will
> directly ask for the corrupted blob when it does an on-demand fetch of
> the filtered object. And that is not covered at all, so let's add a
> test.
>
> It's important here that we use the "misnamed" corruption and not
> "bit-error". The latter is sufficiently corrupted that upload-pack
> cannot even figure out the type of the object, so it bails identically
> both before and after the recent change. But with "misnamed", with the
> hash-checks enabled it sees the problem (though the error messages are a
> bit confusing because of the inability to create a "struct object" to
> store the flags):

Makes sense.

>   error: hash mismatch d95f3ad14dee633a758d2e331151e950dd13e4ed
>   fatal: git upload-pack: not our ref d95f3ad14dee633a758d2e331151e950dd13e4ed
>   fatal: remote error: upload-pack: not our ref d95f3ad14dee633a758d2e331151e950dd13e4ed
>
> After the change to skip the hash check, the server side happily sends
> the bogus object, but the client correctly realizes that it did not get
> the necessary data:
>
>   remote: Enumerating objects: 1, done.
>   remote: Counting objects: 100% (1/1), done.
>   remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
>   Receiving objects: 100% (1/1), 49 bytes | 49.00 KiB/s, done.
>   fatal: bad revision 'd95f3ad14dee633a758d2e331151e950dd13e4ed'
>   error: [...]/misnamed did not send all necessary objects
>
> which is exactly what we expect to happen.

Thanks.  Let's queue it on top.

> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  t/t1060-object-corruption.sh | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/t/t1060-object-corruption.sh b/t/t1060-object-corruption.sh
> index 5b8e47e346..35261afc9d 100755
> --- a/t/t1060-object-corruption.sh
> +++ b/t/t1060-object-corruption.sh
> @@ -139,4 +139,11 @@ test_expect_success 'internal tree objects are not "missing"' '
>  	)
>  '
>  
> +test_expect_success 'partial clone of corrupted repository' '
> +	test_config -C misnamed uploadpack.allowFilter true &&
> +	git clone --no-local --no-checkout --filter=blob:none \
> +		misnamed corrupt-partial && \
> +	test_must_fail git -C corrupt-partial checkout --force
> +'
> +
>  test_done

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [BUG] t1800: Fails for error text comparison
  2022-09-07 21:55                     ` Junio C Hamano
@ 2022-09-07 22:23                       ` rsbecker
  0 siblings, 0 replies; 40+ messages in thread
From: rsbecker @ 2022-09-07 22:23 UTC (permalink / raw)
  To: 'Junio C Hamano'; +Cc: git

On September 7, 2022 5:56 PM, Junio C Hamano wrote:
><rsbecker@nexbridge.com> writes:
>
>> I am finding an issue with t1800.16 failing as a result of a text
compare:
>>
>> -fatal: cannot run bad-hooks/test-hook: ...
>> +fatal: cannot exec 'bad-hooks/test-hook': Permission denied
>>
>> I don't think this is actually a failure condition but the message
>> text is platform and shell specific.
>
>Isn't this coming from piece of code in start_command()?
>
>		/*
>		 * Attempt to exec using the command and arguments starting
at
>		 * argv.argv[1].  argv.argv[0] contains SHELL_PATH which
will
>		 * be used in the event exec failed with ENOEXEC at which
point
>		 * we will try to interpret the command using 'sh'.
>		 */
>		execve(argv.v[1], (char *const *) argv.v + 1,
>		       (char *const *) childenv);
>		if (errno == ENOEXEC)
>			execve(argv.v[0], (char *const *) argv.v,
>			       (char *const *) childenv);
>
>		if (errno == ENOENT) {
>			if (cmd->silent_exec_failure)
>				child_die(CHILD_ERR_SILENT);
>			child_die(CHILD_ERR_ENOENT);
>		} else {
>			child_die(CHILD_ERR_ERRNO);
>		}
>
>The test apparently expects CHILD_ERR_NOENT, which comes from
>child_err_spew()
>
>	case CHILD_ERR_ENOENT:
>		error_errno("cannot run %s", cmd->args.v[0]);
>		break;
>	case CHILD_ERR_SILENT:
>		break;
>	case CHILD_ERR_ERRNO:
>		error_errno("cannot exec '%s'", cmd->args.v[0]);
>		break;
>	}
>
>but somehow your system fails the execve() with something other than
ENOENT.

The file exists but could not be executed. EPERM was returned. The reason is
that test-hooks exists and contains:

#!/bad/path/no/spaces

which does not exist. This (correctly IMO) translates to EPERM not ENOENT
because the test-hooks file exists but cannot be executed due to the
underlying failure. The underlying ENOENT is hidden.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects
  2022-09-07 22:07                     ` Junio C Hamano
@ 2022-09-08  5:04                       ` Jeff King
  2022-09-08 16:41                         ` Junio C Hamano
  0 siblings, 1 reply; 40+ messages in thread
From: Jeff King @ 2022-09-08  5:04 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: 程洋, Derrick Stolee, git@vger.kernel.org,
	何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

On Wed, Sep 07, 2022 at 03:07:22PM -0700, Junio C Hamano wrote:

> > Subject: [PATCH] t1060: check partial clone of misnamed blob
> [...]
> Thanks.  Let's queue it on top.

<sigh> This fails the leak-check CI job because of existing leaks in
"clone --filter". I think I found the (or perhaps "a") bottom of that
rabbit hole, though:

  https://lore.kernel.org/git/Yxl1BNQoy6Drf0Oe@coredump.intra.peff.net/

-Peff

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [External Mail]Re: [PATCH 3/3] parse_object(): check commit-graph when skip_hash set
  2022-09-07 19:31               ` Junio C Hamano
@ 2022-09-08 10:39                 ` 程洋
  2022-09-08 18:42                   ` Jeff King
  0 siblings, 1 reply; 40+ messages in thread
From: 程洋 @ 2022-09-08 10:39 UTC (permalink / raw)
  To: Junio C Hamano, Jeff King
  Cc: Derrick Stolee, git@vger.kernel.org, 何浩,
	Xin7 Ma 马鑫, 石奉兵,
	凡军辉, 王汉基

> If the caller told us that they don't care about us checking the
> object hash, then we're free to implement any optimizations that get
> us the parsed value more quickly. An obvious one is to check the
> commit graph before loading an object from disk. And in fact, both of
> the callers who pass in this flag are already doing so before they call
> parse_object()!

> So we can simplify those callers, as well as any possible future ones,
> by moving the logic into parse_object().

I need to mention that there is serious issue with commit-graph together with partial-clone
https://lore.kernel.org/git/20220709005227.82423-1-hanxin.hx@bytedance.com/

So actually I told my users to disable commit-graph manually
#/******本邮件及其附件含有小米公司的保密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本邮件！ This e-mail and its attachments contain confidential information from XIAOMI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!******/#

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects
  2022-09-08  5:04                       ` Jeff King
@ 2022-09-08 16:41                         ` Junio C Hamano
  0 siblings, 0 replies; 40+ messages in thread
From: Junio C Hamano @ 2022-09-08 16:41 UTC (permalink / raw)
  To: Jeff King
  Cc: 程洋, Derrick Stolee, git@vger.kernel.org,
	何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

Jeff King <peff@peff.net> writes:

> On Wed, Sep 07, 2022 at 03:07:22PM -0700, Junio C Hamano wrote:
>
>> > Subject: [PATCH] t1060: check partial clone of misnamed blob
>> [...]
>> Thanks.  Let's queue it on top.
>
> <sigh> This fails the leak-check CI job because of existing leaks in
> "clone --filter". I think I found the (or perhaps "a") bottom of that
> rabbit hole, though:
>
>   https://lore.kernel.org/git/Yxl1BNQoy6Drf0Oe@coredump.intra.peff.net/

Thanks.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [External Mail]Re: [PATCH 3/3] parse_object(): check commit-graph when skip_hash set
  2022-09-08 10:39                 ` [External Mail]Re: " 程洋
@ 2022-09-08 18:42                   ` Jeff King
  0 siblings, 0 replies; 40+ messages in thread
From: Jeff King @ 2022-09-08 18:42 UTC (permalink / raw)
  To: 程洋
  Cc: Junio C Hamano, Derrick Stolee, git@vger.kernel.org,
	何浩, Xin7 Ma 马鑫,
	石奉兵, 凡军辉,
	王汉基

On Thu, Sep 08, 2022 at 10:39:08AM +0000, 程洋 wrote:

> > If the caller told us that they don't care about us checking the
> > object hash, then we're free to implement any optimizations that get
> > us the parsed value more quickly. An obvious one is to check the
> > commit graph before loading an object from disk. And in fact, both of
> > the callers who pass in this flag are already doing so before they call
> > parse_object()!
> 
> > So we can simplify those callers, as well as any possible future ones,
> > by moving the logic into parse_object().
> 
> I need to mention that there is serious issue with commit-graph together with partial-clone
> https://lore.kernel.org/git/20220709005227.82423-1-hanxin.hx@bytedance.com/

Yes, though I don't think that changes anything with respect to my
patches. The argument here is simply that if a caller is OK skipping the
hash check, they are OK with using the commit graph, and vice versa.
Until the bug in the linked thread is fixed, it is a good idea to
disable the commit graph. But it does not change the intent of the
calling code.

-Peff

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2022-09-08 18:42 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-11  8:09 Partial-clone cause big performance impact on server 程洋
2022-08-11 17:22 ` Jonathan Tan
2022-08-13  7:55   ` 回复: [External Mail]Re: " 程洋
2022-08-13 11:41     ` 程洋
2022-08-15  5:16     ` ZheNing Hu
2022-08-15 13:15       ` 程洋
2022-08-12 12:21 ` Derrick Stolee
2022-08-14  6:48 ` Jeff King
2022-08-15 13:18   ` Derrick Stolee
2022-08-15 14:50     ` [External Mail]Re: " 程洋
2022-08-17 10:22     ` 程洋
2022-08-17 13:41       ` Derrick Stolee
2022-08-18  5:49         ` Jeff King
2022-09-01  6:53   ` 程洋
2022-09-01 16:19     ` Jeff King
2022-09-05 11:17       ` 程洋
2022-09-06 18:38         ` Jeff King
2022-09-06 22:58           ` [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone Jeff King
2022-09-06 23:01             ` [PATCH 1/3] parse_object(): allow skipping hash check Jeff King
2022-09-07 14:15               ` Derrick Stolee
2022-09-07 20:44                 ` Jeff King
2022-09-06 23:05             ` [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects Jeff King
2022-09-07 14:36               ` Derrick Stolee
2022-09-07 14:45                 ` Derrick Stolee
2022-09-07 20:50                   ` Jeff King
2022-09-07 19:26               ` Junio C Hamano
2022-09-07 20:36                 ` Jeff King
2022-09-07 20:48                   ` [BUG] t1800: Fails for error text comparison rsbecker
2022-09-07 21:55                     ` Junio C Hamano
2022-09-07 22:23                       ` rsbecker
2022-09-07 21:02                   ` [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects Jeff King
2022-09-07 22:07                     ` Junio C Hamano
2022-09-08  5:04                       ` Jeff King
2022-09-08 16:41                         ` Junio C Hamano
2022-09-06 23:06             ` [PATCH 3/3] parse_object(): check commit-graph when skip_hash set Jeff King
2022-09-07 14:46               ` Derrick Stolee
2022-09-07 19:31               ` Junio C Hamano
2022-09-08 10:39                 ` [External Mail]Re: " 程洋
2022-09-08 18:42                   ` Jeff King
2022-09-07 14:48             ` [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone Derrick Stolee

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).