Question relate to collaboration on git monorepo

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Question relate to collaboration on git monorepo
@ 2022-09-20 12:42 ZheNing Hu
  2022-09-20 18:53 ` Emily Shaffer
  2022-09-21  1:47 ` Elijah Newren
  0 siblings, 2 replies; 12+ messages in thread
From: ZheNing Hu @ 2022-09-20 12:42 UTC (permalink / raw)
  To: Git List
  Cc: Derrick Stolee, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Johannes Schindelin,
	Victoria Dye, Elijah Newren

Hey, guys,

If two users of git monorepo are working on different sub project
/project1 and /project2 by partial-clone and sparse-checkout ,
if user one push first, then user two want to push too, he must
pull some blob which pushed by user one. I guess their repo size
will gradually increase by other's project's objects, so is there any way
to delete unnecessary blobs out of working project (sparse-checkout
filterspec), or just git pull don't really fetch these unnecessary blobs?

The large number of interruptions in git push may be another
problem, if thousands of probjects are in one monorepo, and
no one else has any code that would conflict with me in any way,
but I need pull everytime? Is there a way to make improvements
here?

Here's an example of how two users constrain each other when git push.

#!/bin/sh

rm -rf mono-repo
git init mono-repo -b main
(
  cd mono-repo
  mkdir project1
  mkdir project2
  for ((i=0;i<10;i++))
  do
    echo $i >> project1/file1
    echo $i >> project2/file2
  done
  git add .
  git commit -m init
)

rm -rf mono-repo.git
git clone --bare mono-repo

# user1
rm -rf m1
git clone --filter="blob:none" --no-checkout --no-local ./mono-repo.git m1
(
  cd m1
  git sparse-checkout set project1
  git checkout main
  for ((i=0;i<10;i++))
  do
    echo "data1-$i" >> project1/file1
    git add .
    git commit -m "c1 $i"
  done
)

# user2
rm -rf m2
git clone --filter="blob:none" --no-checkout --no-local ./mono-repo.git m2
(
  cd m2
  git sparse-checkout set project2
  git checkout main
  for ((i=0;i<10;i++))
  do
    echo "data2-$i" >> project2/file2
    git add .
    git commit -m "c2 $i"
  done
)

# user1 push
(
  cd m1
  git push
)

# user2 push failed, then pull user1's blob
(
  cd m2
  git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
blob_count1
  git push
  git -c pull.rebase=false pull --no-edit #no conflict
  git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
blob_count2
  diff blob_count1 blob_count2
)

--
ZheNing Hu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question relate to collaboration on git monorepo
  2022-09-20 12:42 Question relate to collaboration on git monorepo ZheNing Hu
@ 2022-09-20 18:53 ` Emily Shaffer
  2022-09-21 15:22   ` ZheNing Hu
  2022-09-21  1:47 ` Elijah Newren
  1 sibling, 1 reply; 12+ messages in thread
From: Emily Shaffer @ 2022-09-20 18:53 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Git List, Derrick Stolee, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Johannes Schindelin,
	Victoria Dye, Elijah Newren

On Tue, Sep 20, 2022 at 5:42 AM ZheNing Hu <adlternative@gmail.com> wrote:
>
> Hey, guys,
>
> If two users of git monorepo are working on different sub project
> /project1 and /project2 by partial-clone and sparse-checkout ,
> if user one push first, then user two want to push too, he must
> pull some blob which pushed by user one. I guess their repo size
> will gradually increase by other's project's objects, so is there any way
> to delete unnecessary blobs out of working project (sparse-checkout
> filterspec), or just git pull don't really fetch these unnecessary blobs?

This is exactly what the combination of partial clone and sparse
checkout is for!

Dev A is working on project1/, and excludes project2/ from her sparse
filter; she also cloned with `--filter=blob:none`.
Dev B is working on project2/, and excludes project1/ from his sparse
filter, and similarly  is using blob:none partial clone filter.

Assuming everybody is contributing by direct push, and not using a
code review tool or something else which handles the push for them...
Dev A finishes first, and pushes.
Dev B needs to pull, like you say - but during that pull he doesn't
need to fetch the objects in project1, because they're excluded by the
combination of his partial clone filter and his sparse checkout
pattern. The pull needs to happen because there is a new commit which
Dev B's commit needs to treat as a parent, and so Dev B's client needs
to know the ID of that commit.

>
> The large number of interruptions in git push may be another
> problem, if thousands of probjects are in one monorepo, and
> no one else has any code that would conflict with me in any way,
> but I need pull everytime? Is there a way to make improvements
> here?

The typical improvement people make here is to use some form of
automation or tooling to perform the push and merge for them. That
usually falls to the code review tool. We can call the history like
this: "S" is the source commit which both A and B branched from, and
"A" and "B" are the commits by their respective owners. Because of the
order of push, we want the final commit history to look like "S -> A
-> B". Dev A's local history looks like "S -> A" and Dev B's local
history looks like "S -> B".

If we're using the GitHub PR model, then GitHub may do merge commits
for us, and it creates those merge commits automatically at the time
someone pushes "Merge PR" (or whatever the button is called). So our
history probably looks like:
o  (merge B)
|   \
o   |  (merge A)
|\  |
| | B
| A |
| / /
S

In this case, neither A or B need to know about each other, because
the merge commit is being created by the code review tool.

With tooling like Gerrit, or other tooling that uses the rebase
strategy (rather than merge), pretty much the same thing happens -
both devs can push without knowing about their own commit because the
review tool's automation performs the rebase (that is, the "git pull"
you described) for them.

But if you're not using tooling, yeah, Dev B needs to know which
commit should come before his own commit, so he needs to fetch latest
history, even though the only changes are from Dev A who was working
somewhere else in the monorepo.

>
> Here's an example of how two users constrain each other when git push.
>
> #!/bin/sh
>
> rm -rf mono-repo
> git init mono-repo -b main
> (
>   cd mono-repo
>   mkdir project1
>   mkdir project2
>   for ((i=0;i<10;i++))
>   do
>     echo $i >> project1/file1
>     echo $i >> project2/file2
>   done
>   git add .
>   git commit -m init
> )
>
> rm -rf mono-repo.git
> git clone --bare mono-repo
>
> # user1
> rm -rf m1
> git clone --filter="blob:none" --no-checkout --no-local ./mono-repo.git m1
> (
>   cd m1
>   git sparse-checkout set project1
>   git checkout main
>   for ((i=0;i<10;i++))
>   do
>     echo "data1-$i" >> project1/file1
>     git add .
>     git commit -m "c1 $i"
>   done
> )
>
> # user2
> rm -rf m2
> git clone --filter="blob:none" --no-checkout --no-local ./mono-repo.git m2
> (
>   cd m2
>   git sparse-checkout set project2
>   git checkout main
>   for ((i=0;i<10;i++))
>   do
>     echo "data2-$i" >> project2/file2
>     git add .
>     git commit -m "c2 $i"
>   done
> )
>
> # user1 push
> (
>   cd m1
>   git push
> )
>
> # user2 push failed, then pull user1's blob
> (
>   cd m2
>   git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
> blob_count1
>   git push
>   git -c pull.rebase=false pull --no-edit #no conflict
>   git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
> blob_count2
>   diff blob_count1 blob_count2
> )
>
> --
> ZheNing Hu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question relate to collaboration on git monorepo
  2022-09-20 12:42 Question relate to collaboration on git monorepo ZheNing Hu
  2022-09-20 18:53 ` Emily Shaffer
@ 2022-09-21  1:47 ` Elijah Newren
  2022-09-21 15:42   ` ZheNing Hu
  1 sibling, 1 reply; 12+ messages in thread
From: Elijah Newren @ 2022-09-21  1:47 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Git List, Derrick Stolee, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Johannes Schindelin,
	Victoria Dye

On Tue, Sep 20, 2022 at 5:42 AM ZheNing Hu <adlternative@gmail.com> wrote:
>
> Hey, guys,
>
> If two users of git monorepo are working on different sub project
> /project1 and /project2 by partial-clone and sparse-checkout ,
> if user one push first, then user two want to push too, he must
> pull some blob which pushed by user one.

This is not true.  While user two must pull the new commit and any new
trees pushed by user one (which will mean knowing the hashes of the
new files), there is no need to download the actual content of the new
files unless and until some git command is run that attempts to view
the file's contents.

> The large number of interruptions in git push may be another
> problem, if thousands of probjects are in one monorepo, and
> no one else has any code that would conflict with me in any way,
> but I need pull everytime? Is there a way to make improvements
> here?

No, you only need to pull when attempting to push back to the server.

Further, if you're worried that the second push will fail, you could
easily script it and put "pull --rebase && push" in a loop until it
succeeds (I mean, you did say no one would have any conflicts).  In
fact, you could just make that a common script distributed to your
users and tell them to run that instead of "git push" if they don't
want to worry about manually updating.

Now, if you have thousands of nearly fully independent subprojects and
lots of developers for each subproject and they all commit & push
*very* frequently, I guess you might be able to eventually get to the
scale where you are worried there will be so much contention that the
script will take too long.  I'd be surprised if you got that far, but
even if you did, you could easily adopt a lieutenant-like workflow
(somewhat like the linux kernel, but even simpler given the
independence of your projects).  In such a workflow, you'd let people
in subprojects push to their subproject fork (instead of to the "main"
or "central" repository), and the lieutenants of the subprojects then
periodically push work from that subproject to the main project in
batches.

I don't really see much need here for improvements, myself.

> Here's an example of how two users constrain each other when git push.

Did you pay attention to warnings you got along the way?  In particular...

> git clone --bare mono-repo

You missed the following command right after your clone:

   git -C mono-repo.git config uploadpack.allowFilter true

> # user1
> rm -rf m1
> git clone --filter="blob:none" --no-checkout --no-local ./mono-repo.git m1

Since you forgot to set the important config I mentioned above, your
command here generates the following line of output, among others:

    warning: filtering not recognized by server, ignoring

This warning means you weren't testing partial clones, but regular
full clones.  Perhaps that was the cause of your confusion?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question relate to collaboration on git monorepo
  2022-09-20 18:53 ` Emily Shaffer
@ 2022-09-21 15:22   ` ZheNing Hu
  2022-09-21 23:36     ` Elijah Newren
  0 siblings, 1 reply; 12+ messages in thread
From: ZheNing Hu @ 2022-09-21 15:22 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: Git List, Derrick Stolee, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Johannes Schindelin,
	Victoria Dye, Elijah Newren

Emily Shaffer <emilyshaffer@google.com> 于2022年9月21日周三 02:53写道：
>
> On Tue, Sep 20, 2022 at 5:42 AM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > Hey, guys,
> >
> > If two users of git monorepo are working on different sub project
> > /project1 and /project2 by partial-clone and sparse-checkout ,
> > if user one push first, then user two want to push too, he must
> > pull some blob which pushed by user one. I guess their repo size
> > will gradually increase by other's project's objects, so is there any way
> > to delete unnecessary blobs out of working project (sparse-checkout
> > filterspec), or just git pull don't really fetch these unnecessary blobs?
>
> This is exactly what the combination of partial clone and sparse
> checkout is for!
>
> Dev A is working on project1/, and excludes project2/ from her sparse
> filter; she also cloned with `--filter=blob:none`.
> Dev B is working on project2/, and excludes project1/ from his sparse
> filter, and similarly  is using blob:none partial clone filter.
>
> Assuming everybody is contributing by direct push, and not using a
> code review tool or something else which handles the push for them...
> Dev A finishes first, and pushes.
> Dev B needs to pull, like you say - but during that pull he doesn't
> need to fetch the objects in project1, because they're excluded by the
> combination of his partial clone filter and his sparse checkout
> pattern. The pull needs to happen because there is a new commit which
> Dev B's commit needs to treat as a parent, and so Dev B's client needs
> to know the ID of that commit.
>

I don't agree here, it indeed fetches the blobs during git pull. So I
do a little
change in the previous test:

(
  cd m2
  git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
blob_count1
#  git push
#  git -c pull.rebase=false pull --no-edit #no conflict
  git fetch origin main
  git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
blob_count2
  git merge --no-edit origin/main
  git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
blob_count3
  printf "blob_count1=%s\n" $(cat blob_count1)
  printf "blob_count2=%s\n" $(cat blob_count2)
  printf "blob_count3=%s\n" $(cat blob_count3)
)

warning: This repository uses promisor remotes. Some objects may not be loaded.
remote: Enumerating objects: 32, done.
remote: Counting objects: 100% (32/32), done.
remote: Compressing objects: 100% (20/20), done.
remote: Total 30 (delta 0), reused 0 (delta 0), pack-reused 0
Receiving objects: 100% (30/30), 2.61 KiB | 2.61 MiB/s, done.
From /Users/adl/./mono-repo
 * branch            main       -> FETCH_HEAD
   a6a17f2..16a8585  main       -> origin/main
warning: This repository uses promisor remotes. Some objects may not be loaded.
Merge made by the 'ort' strategy.
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
Receiving objects: 100% (1/1), 87 bytes | 87.00 KiB/s, done.
 project1/file1 | 10 ++++++++++
 1 file changed, 10 insertions(+)
warning: This repository uses promisor remotes. Some objects may not be loaded.
blob_count1=11
blob_count2=11
blob_count3=12

The result shows that blob count doesn't change in git fetch, but in git merge.
However, not all the history of this blob will be pulled down here, so
the growth
of the local repository should be slow. So I was concerned about whether there
was a way to periodically clean up these unneeded blob.

> >
> > The large number of interruptions in git push may be another
> > problem, if thousands of probjects are in one monorepo, and
> > no one else has any code that would conflict with me in any way,
> > but I need pull everytime? Is there a way to make improvements
> > here?
>
> The typical improvement people make here is to use some form of
> automation or tooling to perform the push and merge for them. That
> usually falls to the code review tool. We can call the history like
> this: "S" is the source commit which both A and B branched from, and
> "A" and "B" are the commits by their respective owners. Because of the
> order of push, we want the final commit history to look like "S -> A
> -> B". Dev A's local history looks like "S -> A" and Dev B's local
> history looks like "S -> B".
>
> If we're using the GitHub PR model, then GitHub may do merge commits
> for us, and it creates those merge commits automatically at the time
> someone pushes "Merge PR" (or whatever the button is called). So our
> history probably looks like:
> o  (merge B)
> |   \
> o   |  (merge A)
> |\  |
> | | B
> | A |
> | / /
> S
>
> In this case, neither A or B need to know about each other, because
> the merge commit is being created by the code review tool.
>
> With tooling like Gerrit, or other tooling that uses the rebase
> strategy (rather than merge), pretty much the same thing happens -
> both devs can push without knowing about their own commit because the
> review tool's automation performs the rebase (that is, the "git pull"
> you described) for them.
>

Agree. Using Github PR or Gerrit, the Merge/Rebase process occurs on
a remote server, so local repo will not do git merge, and so don't need to
fetch blobs.

> But if you're not using tooling, yeah, Dev B needs to know which
> commit should come before his own commit, so he needs to fetch latest
> history, even though the only changes are from Dev A who was working
> somewhere else in the monorepo.
>

Thanks for the answer,
ZheNing Hu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question relate to collaboration on git monorepo
  2022-09-21  1:47 ` Elijah Newren
@ 2022-09-21 15:42   ` ZheNing Hu
  0 siblings, 0 replies; 12+ messages in thread
From: ZheNing Hu @ 2022-09-21 15:42 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git List, Derrick Stolee, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Johannes Schindelin,
	Victoria Dye

Elijah Newren <newren@gmail.com> 于2022年9月21日周三 09:48写道：
>
> On Tue, Sep 20, 2022 at 5:42 AM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > Hey, guys,
> >
> > If two users of git monorepo are working on different sub project
> > /project1 and /project2 by partial-clone and sparse-checkout ,
> > if user one push first, then user two want to push too, he must
> > pull some blob which pushed by user one.
>
> This is not true.  While user two must pull the new commit and any new
> trees pushed by user one (which will mean knowing the hashes of the
> new files), there is no need to download the actual content of the new
> files unless and until some git command is run that attempts to view
> the file's contents.
>

Yeah, now I understand that git fetch will not download blobs out of
the sparse-checkout pattern, but git merge will. So git pull will
download some missing blobs here.

> > The large number of interruptions in git push may be another
> > problem, if thousands of probjects are in one monorepo, and
> > no one else has any code that would conflict with me in any way,
> > but I need pull everytime? Is there a way to make improvements
> > here?
>
> No, you only need to pull when attempting to push back to the server.
>
> Further, if you're worried that the second push will fail, you could
> easily script it and put "pull --rebase && push" in a loop until it
> succeeds (I mean, you did say no one would have any conflicts).  In
> fact, you could just make that a common script distributed to your
> users and tell them to run that instead of "git push" if they don't
> want to worry about manually updating.
>

Ah, This method looks a little funny, but it maybe can work. This
issue may also apply to some Code Review tools, maybe need
a "pull --rebase && git cr" loop.

> Now, if you have thousands of nearly fully independent subprojects and
> lots of developers for each subproject and they all commit & push
> *very* frequently, I guess you might be able to eventually get to the
> scale where you are worried there will be so much contention that the
> script will take too long.  I'd be surprised if you got that far, but
> even if you did, you could easily adopt a lieutenant-like workflow
> (somewhat like the linux kernel, but even simpler given the
> independence of your projects).  In such a workflow, you'd let people
> in subprojects push to their subproject fork (instead of to the "main"
> or "central" repository), and the lieutenants of the subprojects then
> periodically push work from that subproject to the main project in
> batches.
>

Make sense. When this mono-repo really has this kind of scale,
splitting the workflow might be the right thing to do.

> I don't really see much need here for improvements, myself.
>
> > Here's an example of how two users constrain each other when git push.
>
> Did you pay attention to warnings you got along the way?  In particular...
>
> > git clone --bare mono-repo
>
> You missed the following command right after your clone:
>
>    git -C mono-repo.git config uploadpack.allowFilter true
>
> > # user1
> > rm -rf m1
> > git clone --filter="blob:none" --no-checkout --no-local ./mono-repo.git m1
>
> Since you forgot to set the important config I mentioned above, your
> command here generates the following line of output, among others:
>
>     warning: filtering not recognized by server, ignoring
>
> This warning means you weren't testing partial clones, but regular
> full clones.  Perhaps that was the cause of your confusion?

Oh, sorry for forget record this, I have config them globally:

uploadpack.allowanysha1inwant=true
uploadpack.allowfilter=true

Thanks for the answer,
ZheNing Hu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question relate to collaboration on git monorepo
  2022-09-21 15:22   ` ZheNing Hu
@ 2022-09-21 23:36     ` Elijah Newren
  2022-09-22 14:24       ` Derrick Stolee
  2022-09-23 14:31       ` ZheNing Hu
  0 siblings, 2 replies; 12+ messages in thread
From: Elijah Newren @ 2022-09-21 23:36 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Emily Shaffer, Git List, Derrick Stolee, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Johannes Schindelin,
	Victoria Dye

On Wed, Sep 21, 2022 at 8:22 AM ZheNing Hu <adlternative@gmail.com> wrote:
>
> Emily Shaffer <emilyshaffer@google.com> 于2022年9月21日周三 02:53写道：
> >
> > On Tue, Sep 20, 2022 at 5:42 AM ZheNing Hu <adlternative@gmail.com> wrote:
> > >
> > > Hey, guys,
> > >
> > > If two users of git monorepo are working on different sub project
> > > /project1 and /project2 by partial-clone and sparse-checkout ,
> > > if user one push first, then user two want to push too, he must
> > > pull some blob which pushed by user one. I guess their repo size
> > > will gradually increase by other's project's objects, so is there any way
> > > to delete unnecessary blobs out of working project (sparse-checkout
> > > filterspec), or just git pull don't really fetch these unnecessary blobs?
> >
> > This is exactly what the combination of partial clone and sparse
> > checkout is for!
> >
> > Dev A is working on project1/, and excludes project2/ from her sparse
> > filter; she also cloned with `--filter=blob:none`.
> > Dev B is working on project2/, and excludes project1/ from his sparse
> > filter, and similarly  is using blob:none partial clone filter.
> >
> > Assuming everybody is contributing by direct push, and not using a
> > code review tool or something else which handles the push for them...
> > Dev A finishes first, and pushes.
> > Dev B needs to pull, like you say - but during that pull he doesn't
> > need to fetch the objects in project1, because they're excluded by the
> > combination of his partial clone filter and his sparse checkout
> > pattern. The pull needs to happen because there is a new commit which
> > Dev B's commit needs to treat as a parent, and so Dev B's client needs
> > to know the ID of that commit.
> >
>
> I don't agree here, it indeed fetches the blobs during git pull. So I
> do a little
> change in the previous test:
>
> (
>   cd m2
>   git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
> blob_count1
> #  git push
> #  git -c pull.rebase=false pull --no-edit #no conflict
>   git fetch origin main
>   git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
> blob_count2
>   git merge --no-edit origin/main
>   git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
> blob_count3
>   printf "blob_count1=%s\n" $(cat blob_count1)
>   printf "blob_count2=%s\n" $(cat blob_count2)
>   printf "blob_count3=%s\n" $(cat blob_count3)
> )
>
> warning: This repository uses promisor remotes. Some objects may not be loaded.
> remote: Enumerating objects: 32, done.
> remote: Counting objects: 100% (32/32), done.
> remote: Compressing objects: 100% (20/20), done.
> remote: Total 30 (delta 0), reused 0 (delta 0), pack-reused 0
> Receiving objects: 100% (30/30), 2.61 KiB | 2.61 MiB/s, done.
> From /Users/adl/./mono-repo
>  * branch            main       -> FETCH_HEAD
>    a6a17f2..16a8585  main       -> origin/main
> warning: This repository uses promisor remotes. Some objects may not be loaded.
> Merge made by the 'ort' strategy.

Note: The merge completed successfully, and we see no evidence of
additional blobs being downloaded before this point.

> remote: Enumerating objects: 1, done.
> remote: Counting objects: 100% (1/1), done.
> remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
> Receiving objects: 100% (1/1), 87 bytes | 87.00 KiB/s, done.

Here, we do have an object download, which occurred after the merge
completed, so there must be something happening after the merge which
needs the extra blob; if we keep reading...

>  project1/file1 | 10 ++++++++++
>  1 file changed, 10 insertions(+)

Ah, the 'helpful' diffstat.  It downloads blobs from a promisor remote
just so we can see what has changed, including in the area of the
project we don't care about.

(This is yet another reason it'd be nice to have a --restrict mode for
grep/diff/log/etc. for sparse-checkout uses, and an ability to make it
the default in some repo, so you could get just the diffstat within
the region of the project that you care about.  We're discussing such
an idea, but it isn't implemented yet.)

> warning: This repository uses promisor remotes. Some objects may not be loaded.
> blob_count1=11
> blob_count2=11
> blob_count3=12
>
> The result shows that blob count doesn't change in git fetch, but in git merge.

If you add --no-stat to your merge command (or set merge.stat to
false), the extra blob will not be downloaded.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question relate to collaboration on git monorepo
  2022-09-21 23:36     ` Elijah Newren
@ 2022-09-22 14:24       ` Derrick Stolee
  2022-09-22 15:20         ` Emily Shaffer
  2022-09-23 15:46         ` Junio C Hamano
  2022-09-23 14:31       ` ZheNing Hu
  1 sibling, 2 replies; 12+ messages in thread
From: Derrick Stolee @ 2022-09-22 14:24 UTC (permalink / raw)
  To: Elijah Newren, ZheNing Hu
  Cc: Emily Shaffer, Git List, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Johannes Schindelin,
	Victoria Dye

On 9/21/2022 7:36 PM, Elijah Newren wrote:
> On Wed, Sep 21, 2022 at 8:22 AM ZheNing Hu <adlternative@gmail.com> wrote:

> Here, we do have an object download, which occurred after the merge
> completed, so there must be something happening after the merge which
> needs the extra blob; if we keep reading...
> 
>>  project1/file1 | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
> 
> Ah, the 'helpful' diffstat.  It downloads blobs from a promisor remote
> just so we can see what has changed, including in the area of the
> project we don't care about.
> 
> (This is yet another reason it'd be nice to have a --restrict mode for
> grep/diff/log/etc. for sparse-checkout uses, and an ability to make it
> the default in some repo, so you could get just the diffstat within
> the region of the project that you care about.  We're discussing such
> an idea, but it isn't implemented yet.)
> 
>> warning: This repository uses promisor remotes. Some objects may not be loaded.
>> blob_count1=11
>> blob_count2=11
>> blob_count3=12
>>
>> The result shows that blob count doesn't change in git fetch, but in git merge.
> 
> If you add --no-stat to your merge command (or set merge.stat to
> false), the extra blob will not be downloaded.

This is an interesting find! I wonder how many people are hitting this
in the wild. Perhaps merge.stat should be added to the optional, but
recommended config options in scalar.c.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question relate to collaboration on git monorepo
  2022-09-22 14:24       ` Derrick Stolee
@ 2022-09-22 15:20         ` Emily Shaffer
  2022-09-23  2:08           ` Elijah Newren
  2022-09-23 15:46         ` Junio C Hamano
  1 sibling, 1 reply; 12+ messages in thread
From: Emily Shaffer @ 2022-09-22 15:20 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren, ZheNing Hu, Git List, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Johannes Schindelin,
	Victoria Dye

Yeah, that's a weird one. I wonder whether the partial clone filter
setting should also imply a config shutting off diffstats?
 - Emily

On Thu, Sep 22, 2022 at 7:24 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 9/21/2022 7:36 PM, Elijah Newren wrote:
> > On Wed, Sep 21, 2022 at 8:22 AM ZheNing Hu <adlternative@gmail.com> wrote:
>
> > Here, we do have an object download, which occurred after the merge
> > completed, so there must be something happening after the merge which
> > needs the extra blob; if we keep reading...
> >
> >>  project1/file1 | 10 ++++++++++
> >>  1 file changed, 10 insertions(+)
> >
> > Ah, the 'helpful' diffstat.  It downloads blobs from a promisor remote
> > just so we can see what has changed, including in the area of the
> > project we don't care about.
> >
> > (This is yet another reason it'd be nice to have a --restrict mode for
> > grep/diff/log/etc. for sparse-checkout uses, and an ability to make it
> > the default in some repo, so you could get just the diffstat within
> > the region of the project that you care about.  We're discussing such
> > an idea, but it isn't implemented yet.)
> >
> >> warning: This repository uses promisor remotes. Some objects may not be loaded.
> >> blob_count1=11
> >> blob_count2=11
> >> blob_count3=12
> >>
> >> The result shows that blob count doesn't change in git fetch, but in git merge.
> >
> > If you add --no-stat to your merge command (or set merge.stat to
> > false), the extra blob will not be downloaded.
>
> This is an interesting find! I wonder how many people are hitting this
> in the wild. Perhaps merge.stat should be added to the optional, but
> recommended config options in scalar.c.
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question relate to collaboration on git monorepo
  2022-09-22 15:20         ` Emily Shaffer
@ 2022-09-23  2:08           ` Elijah Newren
  0 siblings, 0 replies; 12+ messages in thread
From: Elijah Newren @ 2022-09-23  2:08 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: Derrick Stolee, ZheNing Hu, Git List, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Johannes Schindelin,
	Victoria Dye

On Thu, Sep 22, 2022 at 8:21 AM Emily Shaffer <emilyshaffer@google.com> wrote:
>
> On Thu, Sep 22, 2022 at 7:24 AM Derrick Stolee <derrickstolee@github.com> wrote:
> >
> > On 9/21/2022 7:36 PM, Elijah Newren wrote:
> > > On Wed, Sep 21, 2022 at 8:22 AM ZheNing Hu <adlternative@gmail.com> wrote:
[...]
> > > If you add --no-stat to your merge command (or set merge.stat to
> > > false), the extra blob will not be downloaded.
> >
> > This is an interesting find! I wonder how many people are hitting this
> > in the wild. Perhaps merge.stat should be added to the optional, but
> > recommended config options in scalar.c.
>
> Yeah, that's a weird one. I wonder whether the partial clone filter
> setting should also imply a config shutting off diffstats?

Yeah, ZheNing hit on an interesting case here, and their very careful
problem description with full testcases helped immensely in finding
the issue.  I didn't spot it at first from the output, but being able
to run his examples made it relatively easy to track down -- and then
I noticed I could have figured out the same information from his
output if I had known to look.  Even if I didn't notice the output the
first time around, having the problem be discoverable from his program
output certainly made it easier to describe and explain the issue.
:-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question relate to collaboration on git monorepo
  2022-09-21 23:36     ` Elijah Newren
  2022-09-22 14:24       ` Derrick Stolee
@ 2022-09-23 14:31       ` ZheNing Hu
  1 sibling, 0 replies; 12+ messages in thread
From: ZheNing Hu @ 2022-09-23 14:31 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Emily Shaffer, Git List, Derrick Stolee, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Johannes Schindelin,
	Victoria Dye

Elijah Newren <newren@gmail.com> 于2022年9月22日周四 07:36写道：
>
> On Wed, Sep 21, 2022 at 8:22 AM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > Emily Shaffer <emilyshaffer@google.com> 于2022年9月21日周三 02:53写道：
> > >
> > > On Tue, Sep 20, 2022 at 5:42 AM ZheNing Hu <adlternative@gmail.com> wrote:
> > > >
> > > > Hey, guys,
> > > >
> > > > If two users of git monorepo are working on different sub project
> > > > /project1 and /project2 by partial-clone and sparse-checkout ,
> > > > if user one push first, then user two want to push too, he must
> > > > pull some blob which pushed by user one. I guess their repo size
> > > > will gradually increase by other's project's objects, so is there any way
> > > > to delete unnecessary blobs out of working project (sparse-checkout
> > > > filterspec), or just git pull don't really fetch these unnecessary blobs?
> > >
> > > This is exactly what the combination of partial clone and sparse
> > > checkout is for!
> > >
> > > Dev A is working on project1/, and excludes project2/ from her sparse
> > > filter; she also cloned with `--filter=blob:none`.
> > > Dev B is working on project2/, and excludes project1/ from his sparse
> > > filter, and similarly  is using blob:none partial clone filter.
> > >
> > > Assuming everybody is contributing by direct push, and not using a
> > > code review tool or something else which handles the push for them...
> > > Dev A finishes first, and pushes.
> > > Dev B needs to pull, like you say - but during that pull he doesn't
> > > need to fetch the objects in project1, because they're excluded by the
> > > combination of his partial clone filter and his sparse checkout
> > > pattern. The pull needs to happen because there is a new commit which
> > > Dev B's commit needs to treat as a parent, and so Dev B's client needs
> > > to know the ID of that commit.
> > >
> >
> > I don't agree here, it indeed fetches the blobs during git pull. So I
> > do a little
> > change in the previous test:
> >
> > (
> >   cd m2
> >   git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
> > blob_count1
> > #  git push
> > #  git -c pull.rebase=false pull --no-edit #no conflict
> >   git fetch origin main
> >   git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
> > blob_count2
> >   git merge --no-edit origin/main
> >   git cat-file --batch-check --batch-all-objects | grep blob | wc -l >
> > blob_count3
> >   printf "blob_count1=%s\n" $(cat blob_count1)
> >   printf "blob_count2=%s\n" $(cat blob_count2)
> >   printf "blob_count3=%s\n" $(cat blob_count3)
> > )
> >
> > warning: This repository uses promisor remotes. Some objects may not be loaded.
> > remote: Enumerating objects: 32, done.
> > remote: Counting objects: 100% (32/32), done.
> > remote: Compressing objects: 100% (20/20), done.
> > remote: Total 30 (delta 0), reused 0 (delta 0), pack-reused 0
> > Receiving objects: 100% (30/30), 2.61 KiB | 2.61 MiB/s, done.
> > From /Users/adl/./mono-repo
> >  * branch            main       -> FETCH_HEAD
> >    a6a17f2..16a8585  main       -> origin/main
> > warning: This repository uses promisor remotes. Some objects may not be loaded.
> > Merge made by the 'ort' strategy.
>
> Note: The merge completed successfully, and we see no evidence of
> additional blobs being downloaded before this point.
>

Agree. Debug message This is not a problem caused by git merge,
but caused by "finish" period of git merge, which fetch missing objects
to show the diffstat.

(lldb) b fetch_objects
Breakpoint 1: where = git`fetch_objects + 29 at
promisor-remote.c:18:23, address = 0x0000000100275f4d
(lldb) r
Process 62227 launched: '/Users/adl/repos/git/git' (x86_64)
Merge made by the 'ort' strategy.
Process 62227 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000100275f4d
git`fetch_objects(repo=0x0000000100406a88, remote_name="origin",
oids=0x0000000101204360, oid_nr=1) at promisor-remote.c:18:23
   15  const struct object_id *oids,
   16  int oid_nr)
   17  {
-> 18  struct child_process child = CHILD_PROCESS_INIT;
   19  int i;
   20  FILE *child_in;
   21
Target 0: (git) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000100275f4d
git`fetch_objects(repo=0x0000000100406a88, remote_name="origin",
oids=0x0000000101204360, oid_nr=1) at promisor-remote.c:18:23
    frame #1: 0x0000000100275ea3
git`promisor_remote_get_direct(repo=0x0000000100406a88,
oids=0x0000000101204360, oid_nr=1) at promisor-remote.c:249:7
    frame #2: 0x00000001001a2fe3
git`diff_queued_diff_prefetch(repository=0x0000000100406a88) at
diff.c:6781:2
    frame #3: 0x00000001001a3075
git`diffcore_std(options=0x00007ff7bfefed20) at diff.c:6805:3
    frame #4: 0x000000010009ca11
git`finish(head_commit=0x000000010151f000,
remoteheads=0x0000600000004390, new_head=0x00007ff7bfeff030,
msg="Merge made by the 'ort' strategy.") at merge.c:499:3
    frame #5: 0x000000010009d787
git`finish_automerge(head=0x000000010151f000, head_subsumed=0,
common=0x0000600000004330, remoteheads=0x0000600000004390,
result_tree=0x00007ff7bfeff280, wt_strategy="ort") at merge.c:960:2
    frame #6: 0x000000010009b07b git`cmd_merge(argc=1,
argv=0x00007ff7bfeff660, prefix=0x0000000000000000) at merge.c:1743:9
    frame #7: 0x0000000100005573 git`run_builtin(p=0x00000001003e0e60,
argc=3, argv=0x00007ff7bfeff660) at git.c:466:11
    frame #8: 0x0000000100004098 git`handle_builtin(argc=3,
argv=0x00007ff7bfeff660) at git.c:721:3
    frame #9: 0x0000000100004f76
git`run_argv(argcp=0x00007ff7bfeff4dc, argv=0x00007ff7bfeff4d0) at
git.c:788:4
    frame #10: 0x0000000100003e69 git`cmd_main(argc=3,
argv=0x00007ff7bfeff660) at git.c:921:19
    frame #11: 0x000000010011e8f6 git`main(argc=4,
argv=0x00007ff7bfeff658) at common-main.c:56:11
    frame #12: 0x00000001005b94fe dyld`start + 462

> > remote: Enumerating objects: 1, done.
> > remote: Counting objects: 100% (1/1), done.
> > remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
> > Receiving objects: 100% (1/1), 87 bytes | 87.00 KiB/s, done.
>
> Here, we do have an object download, which occurred after the merge
> completed, so there must be something happening after the merge which
> needs the extra blob; if we keep reading...
>
> >  project1/file1 | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
>
> Ah, the 'helpful' diffstat.  It downloads blobs from a promisor remote
> just so we can see what has changed, including in the area of the
> project we don't care about.
>
> (This is yet another reason it'd be nice to have a --restrict mode for
> grep/diff/log/etc. for sparse-checkout uses, and an ability to make it
> the default in some repo, so you could get just the diffstat within
> the region of the project that you care about.  We're discussing such
> an idea, but it isn't implemented yet.)
>
> > warning: This repository uses promisor remotes. Some objects may not be loaded.
> > blob_count1=11
> > blob_count2=11
> > blob_count3=12
> >
> > The result shows that blob count doesn't change in git fetch, but in git merge.
>
> If you add --no-stat to your merge command (or set merge.stat to
> false), the extra blob will not be downloaded.

After config merge.stat to false, the problem is solved. Thanks a lot!

ZheNing Hu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question relate to collaboration on git monorepo
  2022-09-22 14:24       ` Derrick Stolee
  2022-09-22 15:20         ` Emily Shaffer
@ 2022-09-23 15:46         ` Junio C Hamano
  2022-09-23 18:11           ` Derrick Stolee
  1 sibling, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2022-09-23 15:46 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren, ZheNing Hu, Emily Shaffer, Git List,
	Ævar Arnfjörð Bjarmason, Johannes Schindelin,
	Victoria Dye

Derrick Stolee <derrickstolee@github.com> writes:

> On 9/21/2022 7:36 PM, Elijah Newren wrote:
>> On Wed, Sep 21, 2022 at 8:22 AM ZheNing Hu <adlternative@gmail.com> wrote:
>
>> Here, we do have an object download, which occurred after the merge
>> completed, so there must be something happening after the merge which
>> needs the extra blob; if we keep reading...
>> 
>>>  project1/file1 | 10 ++++++++++
>>>  1 file changed, 10 insertions(+)
>> 
>> Ah, the 'helpful' diffstat.  It downloads blobs from a promisor remote
>> just so we can see what has changed, including in the area of the
>> project we don't care about.
>> ...
> This is an interesting find! I wonder how many people are hitting this
> in the wild. Perhaps merge.stat should be added to the optional, but
> recommended config options in scalar.c.

Hmph.  It somehow sounds like throwing the baby out with the
bathwater, doesn't it.

You are only interested in a few directories in the project.  You
pull from somebody else (or the central repository), and end up
getting updates to both inside and outside the areas of your
interest.

As a project gets larger and better modularized, does it become more
likely that such an update will happen more often?

I am very tempted to suggest that the diffstat after a merge in such
a project should use the sparse cone(s) as pathspec.  Disabling the
"what happened in this merge" report altogether is a way too blunt
instrument, isn't it?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question relate to collaboration on git monorepo
  2022-09-23 15:46         ` Junio C Hamano
@ 2022-09-23 18:11           ` Derrick Stolee
  0 siblings, 0 replies; 12+ messages in thread
From: Derrick Stolee @ 2022-09-23 18:11 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren, ZheNing Hu, Emily Shaffer, Git List,
	Ævar Arnfjörð Bjarmason, Johannes Schindelin,
	Victoria Dye

On 9/23/2022 11:46 AM, Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@github.com> writes:

>> This is an interesting find! I wonder how many people are hitting this
>> in the wild. Perhaps merge.stat should be added to the optional, but
>> recommended config options in scalar.c.
> 
> Hmph.  It somehow sounds like throwing the baby out with the
> bathwater, doesn't it.
> 
> You are only interested in a few directories in the project.  You
> pull from somebody else (or the central repository), and end up
> getting updates to both inside and outside the areas of your
> interest.
> 
> As a project gets larger and better modularized, does it become more
> likely that such an update will happen more often?
> 
> I am very tempted to suggest that the diffstat after a merge in such
> a project should use the sparse cone(s) as pathspec.  Disabling the
> "what happened in this merge" report altogether is a way too blunt
> instrument, isn't it?

You're right that having a sparse-checkout-scoped diffstat is likely
to be interesting in the long term. It does require updating its
format to describe "some paths outside the sparse-checkout changed,
but we won't give you the full diffstat there".

Using something like the config to disable it in the short term
would certainly hide the pain point, but you're right that we shouldn't
stop there.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-09-23 18:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-20 12:42 Question relate to collaboration on git monorepo ZheNing Hu
2022-09-20 18:53 ` Emily Shaffer
2022-09-21 15:22   ` ZheNing Hu
2022-09-21 23:36     ` Elijah Newren
2022-09-22 14:24       ` Derrick Stolee
2022-09-22 15:20         ` Emily Shaffer
2022-09-23  2:08           ` Elijah Newren
2022-09-23 15:46         ` Junio C Hamano
2022-09-23 18:11           ` Derrick Stolee
2022-09-23 14:31       ` ZheNing Hu
2022-09-21  1:47 ` Elijah Newren
2022-09-21 15:42   ` ZheNing Hu

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).