* Question relate to collaboration on git monorepo @ 2022-09-20 12:42 ZheNing Hu 2022-09-20 18:53 ` Emily Shaffer 2022-09-21 1:47 ` Elijah Newren 0 siblings, 2 replies; 12+ messages in thread From: ZheNing Hu @ 2022-09-20 12:42 UTC (permalink / raw) To: Git List Cc: Derrick Stolee, Junio C Hamano, Ævar Arnfjörð Bjarmason, Johannes Schindelin, Victoria Dye, Elijah Newren Hey, guys, If two users of git monorepo are working on different sub project /project1 and /project2 by partial-clone and sparse-checkout , if user one push first, then user two want to push too, he must pull some blob which pushed by user one. I guess their repo size will gradually increase by other's project's objects, so is there any way to delete unnecessary blobs out of working project (sparse-checkout filterspec), or just git pull don't really fetch these unnecessary blobs? The large number of interruptions in git push may be another problem, if thousands of probjects are in one monorepo, and no one else has any code that would conflict with me in any way, but I need pull everytime? Is there a way to make improvements here? Here's an example of how two users constrain each other when git push. #!/bin/sh rm -rf mono-repo git init mono-repo -b main ( cd mono-repo mkdir project1 mkdir project2 for ((i=0;i<10;i++)) do echo $i >> project1/file1 echo $i >> project2/file2 done git add . git commit -m init ) rm -rf mono-repo.git git clone --bare mono-repo # user1 rm -rf m1 git clone --filter="blob:none" --no-checkout --no-local ./mono-repo.git m1 ( cd m1 git sparse-checkout set project1 git checkout main for ((i=0;i<10;i++)) do echo "data1-$i" >> project1/file1 git add . git commit -m "c1 $i" done ) # user2 rm -rf m2 git clone --filter="blob:none" --no-checkout --no-local ./mono-repo.git m2 ( cd m2 git sparse-checkout set project2 git checkout main for ((i=0;i<10;i++)) do echo "data2-$i" >> project2/file2 git add . git commit -m "c2 $i" done ) # user1 push ( cd m1 git push ) # user2 push failed, then pull user1's blob ( cd m2 git cat-file --batch-check --batch-all-objects | grep blob | wc -l > blob_count1 git push git -c pull.rebase=false pull --no-edit #no conflict git cat-file --batch-check --batch-all-objects | grep blob | wc -l > blob_count2 diff blob_count1 blob_count2 ) -- ZheNing Hu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Question relate to collaboration on git monorepo 2022-09-20 12:42 Question relate to collaboration on git monorepo ZheNing Hu @ 2022-09-20 18:53 ` Emily Shaffer 2022-09-21 15:22 ` ZheNing Hu 2022-09-21 1:47 ` Elijah Newren 1 sibling, 1 reply; 12+ messages in thread From: Emily Shaffer @ 2022-09-20 18:53 UTC (permalink / raw) To: ZheNing Hu Cc: Git List, Derrick Stolee, Junio C Hamano, Ævar Arnfjörð Bjarmason, Johannes Schindelin, Victoria Dye, Elijah Newren On Tue, Sep 20, 2022 at 5:42 AM ZheNing Hu <adlternative@gmail.com> wrote: > > Hey, guys, > > If two users of git monorepo are working on different sub project > /project1 and /project2 by partial-clone and sparse-checkout , > if user one push first, then user two want to push too, he must > pull some blob which pushed by user one. I guess their repo size > will gradually increase by other's project's objects, so is there any way > to delete unnecessary blobs out of working project (sparse-checkout > filterspec), or just git pull don't really fetch these unnecessary blobs? This is exactly what the combination of partial clone and sparse checkout is for! Dev A is working on project1/, and excludes project2/ from her sparse filter; she also cloned with `--filter=blob:none`. Dev B is working on project2/, and excludes project1/ from his sparse filter, and similarly is using blob:none partial clone filter. Assuming everybody is contributing by direct push, and not using a code review tool or something else which handles the push for them... Dev A finishes first, and pushes. Dev B needs to pull, like you say - but during that pull he doesn't need to fetch the objects in project1, because they're excluded by the combination of his partial clone filter and his sparse checkout pattern. The pull needs to happen because there is a new commit which Dev B's commit needs to treat as a parent, and so Dev B's client needs to know the ID of that commit. > > The large number of interruptions in git push may be another > problem, if thousands of probjects are in one monorepo, and > no one else has any code that would conflict with me in any way, > but I need pull everytime? Is there a way to make improvements > here? The typical improvement people make here is to use some form of automation or tooling to perform the push and merge for them. That usually falls to the code review tool. We can call the history like this: "S" is the source commit which both A and B branched from, and "A" and "B" are the commits by their respective owners. Because of the order of push, we want the final commit history to look like "S -> A -> B". Dev A's local history looks like "S -> A" and Dev B's local history looks like "S -> B". If we're using the GitHub PR model, then GitHub may do merge commits for us, and it creates those merge commits automatically at the time someone pushes "Merge PR" (or whatever the button is called). So our history probably looks like: o (merge B) | \ o | (merge A) |\ | | | B | A | | / / S In this case, neither A or B need to know about each other, because the merge commit is being created by the code review tool. With tooling like Gerrit, or other tooling that uses the rebase strategy (rather than merge), pretty much the same thing happens - both devs can push without knowing about their own commit because the review tool's automation performs the rebase (that is, the "git pull" you described) for them. But if you're not using tooling, yeah, Dev B needs to know which commit should come before his own commit, so he needs to fetch latest history, even though the only changes are from Dev A who was working somewhere else in the monorepo. > > Here's an example of how two users constrain each other when git push. > > #!/bin/sh > > rm -rf mono-repo > git init mono-repo -b main > ( > cd mono-repo > mkdir project1 > mkdir project2 > for ((i=0;i<10;i++)) > do > echo $i >> project1/file1 > echo $i >> project2/file2 > done > git add . > git commit -m init > ) > > rm -rf mono-repo.git > git clone --bare mono-repo > > # user1 > rm -rf m1 > git clone --filter="blob:none" --no-checkout --no-local ./mono-repo.git m1 > ( > cd m1 > git sparse-checkout set project1 > git checkout main > for ((i=0;i<10;i++)) > do > echo "data1-$i" >> project1/file1 > git add . > git commit -m "c1 $i" > done > ) > > # user2 > rm -rf m2 > git clone --filter="blob:none" --no-checkout --no-local ./mono-repo.git m2 > ( > cd m2 > git sparse-checkout set project2 > git checkout main > for ((i=0;i<10;i++)) > do > echo "data2-$i" >> project2/file2 > git add . > git commit -m "c2 $i" > done > ) > > # user1 push > ( > cd m1 > git push > ) > > # user2 push failed, then pull user1's blob > ( > cd m2 > git cat-file --batch-check --batch-all-objects | grep blob | wc -l > > blob_count1 > git push > git -c pull.rebase=false pull --no-edit #no conflict > git cat-file --batch-check --batch-all-objects | grep blob | wc -l > > blob_count2 > diff blob_count1 blob_count2 > ) > > -- > ZheNing Hu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Question relate to collaboration on git monorepo 2022-09-20 18:53 ` Emily Shaffer @ 2022-09-21 15:22 ` ZheNing Hu 2022-09-21 23:36 ` Elijah Newren 0 siblings, 1 reply; 12+ messages in thread From: ZheNing Hu @ 2022-09-21 15:22 UTC (permalink / raw) To: Emily Shaffer Cc: Git List, Derrick Stolee, Junio C Hamano, Ævar Arnfjörð Bjarmason, Johannes Schindelin, Victoria Dye, Elijah Newren Emily Shaffer <emilyshaffer@google.com> 于2022年9月21日周三 02:53写道: > > On Tue, Sep 20, 2022 at 5:42 AM ZheNing Hu <adlternative@gmail.com> wrote: > > > > Hey, guys, > > > > If two users of git monorepo are working on different sub project > > /project1 and /project2 by partial-clone and sparse-checkout , > > if user one push first, then user two want to push too, he must > > pull some blob which pushed by user one. I guess their repo size > > will gradually increase by other's project's objects, so is there any way > > to delete unnecessary blobs out of working project (sparse-checkout > > filterspec), or just git pull don't really fetch these unnecessary blobs? > > This is exactly what the combination of partial clone and sparse > checkout is for! > > Dev A is working on project1/, and excludes project2/ from her sparse > filter; she also cloned with `--filter=blob:none`. > Dev B is working on project2/, and excludes project1/ from his sparse > filter, and similarly is using blob:none partial clone filter. > > Assuming everybody is contributing by direct push, and not using a > code review tool or something else which handles the push for them... > Dev A finishes first, and pushes. > Dev B needs to pull, like you say - but during that pull he doesn't > need to fetch the objects in project1, because they're excluded by the > combination of his partial clone filter and his sparse checkout > pattern. The pull needs to happen because there is a new commit which > Dev B's commit needs to treat as a parent, and so Dev B's client needs > to know the ID of that commit. > I don't agree here, it indeed fetches the blobs during git pull. So I do a little change in the previous test: ( cd m2 git cat-file --batch-check --batch-all-objects | grep blob | wc -l > blob_count1 # git push # git -c pull.rebase=false pull --no-edit #no conflict git fetch origin main git cat-file --batch-check --batch-all-objects | grep blob | wc -l > blob_count2 git merge --no-edit origin/main git cat-file --batch-check --batch-all-objects | grep blob | wc -l > blob_count3 printf "blob_count1=%s\n" $(cat blob_count1) printf "blob_count2=%s\n" $(cat blob_count2) printf "blob_count3=%s\n" $(cat blob_count3) ) warning: This repository uses promisor remotes. Some objects may not be loaded. remote: Enumerating objects: 32, done. remote: Counting objects: 100% (32/32), done. remote: Compressing objects: 100% (20/20), done. remote: Total 30 (delta 0), reused 0 (delta 0), pack-reused 0 Receiving objects: 100% (30/30), 2.61 KiB | 2.61 MiB/s, done. From /Users/adl/./mono-repo * branch main -> FETCH_HEAD a6a17f2..16a8585 main -> origin/main warning: This repository uses promisor remotes. Some objects may not be loaded. Merge made by the 'ort' strategy. remote: Enumerating objects: 1, done. remote: Counting objects: 100% (1/1), done. remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 Receiving objects: 100% (1/1), 87 bytes | 87.00 KiB/s, done. project1/file1 | 10 ++++++++++ 1 file changed, 10 insertions(+) warning: This repository uses promisor remotes. Some objects may not be loaded. blob_count1=11 blob_count2=11 blob_count3=12 The result shows that blob count doesn't change in git fetch, but in git merge. However, not all the history of this blob will be pulled down here, so the growth of the local repository should be slow. So I was concerned about whether there was a way to periodically clean up these unneeded blob. > > > > The large number of interruptions in git push may be another > > problem, if thousands of probjects are in one monorepo, and > > no one else has any code that would conflict with me in any way, > > but I need pull everytime? Is there a way to make improvements > > here? > > The typical improvement people make here is to use some form of > automation or tooling to perform the push and merge for them. That > usually falls to the code review tool. We can call the history like > this: "S" is the source commit which both A and B branched from, and > "A" and "B" are the commits by their respective owners. Because of the > order of push, we want the final commit history to look like "S -> A > -> B". Dev A's local history looks like "S -> A" and Dev B's local > history looks like "S -> B". > > If we're using the GitHub PR model, then GitHub may do merge commits > for us, and it creates those merge commits automatically at the time > someone pushes "Merge PR" (or whatever the button is called). So our > history probably looks like: > o (merge B) > | \ > o | (merge A) > |\ | > | | B > | A | > | / / > S > > In this case, neither A or B need to know about each other, because > the merge commit is being created by the code review tool. > > With tooling like Gerrit, or other tooling that uses the rebase > strategy (rather than merge), pretty much the same thing happens - > both devs can push without knowing about their own commit because the > review tool's automation performs the rebase (that is, the "git pull" > you described) for them. > Agree. Using Github PR or Gerrit, the Merge/Rebase process occurs on a remote server, so local repo will not do git merge, and so don't need to fetch blobs. > But if you're not using tooling, yeah, Dev B needs to know which > commit should come before his own commit, so he needs to fetch latest > history, even though the only changes are from Dev A who was working > somewhere else in the monorepo. > Thanks for the answer, ZheNing Hu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Question relate to collaboration on git monorepo 2022-09-21 15:22 ` ZheNing Hu @ 2022-09-21 23:36 ` Elijah Newren 2022-09-22 14:24 ` Derrick Stolee 2022-09-23 14:31 ` ZheNing Hu 0 siblings, 2 replies; 12+ messages in thread From: Elijah Newren @ 2022-09-21 23:36 UTC (permalink / raw) To: ZheNing Hu Cc: Emily Shaffer, Git List, Derrick Stolee, Junio C Hamano, Ævar Arnfjörð Bjarmason, Johannes Schindelin, Victoria Dye On Wed, Sep 21, 2022 at 8:22 AM ZheNing Hu <adlternative@gmail.com> wrote: > > Emily Shaffer <emilyshaffer@google.com> 于2022年9月21日周三 02:53写道: > > > > On Tue, Sep 20, 2022 at 5:42 AM ZheNing Hu <adlternative@gmail.com> wrote: > > > > > > Hey, guys, > > > > > > If two users of git monorepo are working on different sub project > > > /project1 and /project2 by partial-clone and sparse-checkout , > > > if user one push first, then user two want to push too, he must > > > pull some blob which pushed by user one. I guess their repo size > > > will gradually increase by other's project's objects, so is there any way > > > to delete unnecessary blobs out of working project (sparse-checkout > > > filterspec), or just git pull don't really fetch these unnecessary blobs? > > > > This is exactly what the combination of partial clone and sparse > > checkout is for! > > > > Dev A is working on project1/, and excludes project2/ from her sparse > > filter; she also cloned with `--filter=blob:none`. > > Dev B is working on project2/, and excludes project1/ from his sparse > > filter, and similarly is using blob:none partial clone filter. > > > > Assuming everybody is contributing by direct push, and not using a > > code review tool or something else which handles the push for them... > > Dev A finishes first, and pushes. > > Dev B needs to pull, like you say - but during that pull he doesn't > > need to fetch the objects in project1, because they're excluded by the > > combination of his partial clone filter and his sparse checkout > > pattern. The pull needs to happen because there is a new commit which > > Dev B's commit needs to treat as a parent, and so Dev B's client needs > > to know the ID of that commit. > > > > I don't agree here, it indeed fetches the blobs during git pull. So I > do a little > change in the previous test: > > ( > cd m2 > git cat-file --batch-check --batch-all-objects | grep blob | wc -l > > blob_count1 > # git push > # git -c pull.rebase=false pull --no-edit #no conflict > git fetch origin main > git cat-file --batch-check --batch-all-objects | grep blob | wc -l > > blob_count2 > git merge --no-edit origin/main > git cat-file --batch-check --batch-all-objects | grep blob | wc -l > > blob_count3 > printf "blob_count1=%s\n" $(cat blob_count1) > printf "blob_count2=%s\n" $(cat blob_count2) > printf "blob_count3=%s\n" $(cat blob_count3) > ) > > warning: This repository uses promisor remotes. Some objects may not be loaded. > remote: Enumerating objects: 32, done. > remote: Counting objects: 100% (32/32), done. > remote: Compressing objects: 100% (20/20), done. > remote: Total 30 (delta 0), reused 0 (delta 0), pack-reused 0 > Receiving objects: 100% (30/30), 2.61 KiB | 2.61 MiB/s, done. > From /Users/adl/./mono-repo > * branch main -> FETCH_HEAD > a6a17f2..16a8585 main -> origin/main > warning: This repository uses promisor remotes. Some objects may not be loaded. > Merge made by the 'ort' strategy. Note: The merge completed successfully, and we see no evidence of additional blobs being downloaded before this point. > remote: Enumerating objects: 1, done. > remote: Counting objects: 100% (1/1), done. > remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 > Receiving objects: 100% (1/1), 87 bytes | 87.00 KiB/s, done. Here, we do have an object download, which occurred after the merge completed, so there must be something happening after the merge which needs the extra blob; if we keep reading... > project1/file1 | 10 ++++++++++ > 1 file changed, 10 insertions(+) Ah, the 'helpful' diffstat. It downloads blobs from a promisor remote just so we can see what has changed, including in the area of the project we don't care about. (This is yet another reason it'd be nice to have a --restrict mode for grep/diff/log/etc. for sparse-checkout uses, and an ability to make it the default in some repo, so you could get just the diffstat within the region of the project that you care about. We're discussing such an idea, but it isn't implemented yet.) > warning: This repository uses promisor remotes. Some objects may not be loaded. > blob_count1=11 > blob_count2=11 > blob_count3=12 > > The result shows that blob count doesn't change in git fetch, but in git merge. If you add --no-stat to your merge command (or set merge.stat to false), the extra blob will not be downloaded. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Question relate to collaboration on git monorepo 2022-09-21 23:36 ` Elijah Newren @ 2022-09-22 14:24 ` Derrick Stolee 2022-09-22 15:20 ` Emily Shaffer 2022-09-23 15:46 ` Junio C Hamano 2022-09-23 14:31 ` ZheNing Hu 1 sibling, 2 replies; 12+ messages in thread From: Derrick Stolee @ 2022-09-22 14:24 UTC (permalink / raw) To: Elijah Newren, ZheNing Hu Cc: Emily Shaffer, Git List, Junio C Hamano, Ævar Arnfjörð Bjarmason, Johannes Schindelin, Victoria Dye On 9/21/2022 7:36 PM, Elijah Newren wrote: > On Wed, Sep 21, 2022 at 8:22 AM ZheNing Hu <adlternative@gmail.com> wrote: > Here, we do have an object download, which occurred after the merge > completed, so there must be something happening after the merge which > needs the extra blob; if we keep reading... > >> project1/file1 | 10 ++++++++++ >> 1 file changed, 10 insertions(+) > > Ah, the 'helpful' diffstat. It downloads blobs from a promisor remote > just so we can see what has changed, including in the area of the > project we don't care about. > > (This is yet another reason it'd be nice to have a --restrict mode for > grep/diff/log/etc. for sparse-checkout uses, and an ability to make it > the default in some repo, so you could get just the diffstat within > the region of the project that you care about. We're discussing such > an idea, but it isn't implemented yet.) > >> warning: This repository uses promisor remotes. Some objects may not be loaded. >> blob_count1=11 >> blob_count2=11 >> blob_count3=12 >> >> The result shows that blob count doesn't change in git fetch, but in git merge. > > If you add --no-stat to your merge command (or set merge.stat to > false), the extra blob will not be downloaded. This is an interesting find! I wonder how many people are hitting this in the wild. Perhaps merge.stat should be added to the optional, but recommended config options in scalar.c. Thanks, -Stolee ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Question relate to collaboration on git monorepo 2022-09-22 14:24 ` Derrick Stolee @ 2022-09-22 15:20 ` Emily Shaffer 2022-09-23 2:08 ` Elijah Newren 2022-09-23 15:46 ` Junio C Hamano 1 sibling, 1 reply; 12+ messages in thread From: Emily Shaffer @ 2022-09-22 15:20 UTC (permalink / raw) To: Derrick Stolee Cc: Elijah Newren, ZheNing Hu, Git List, Junio C Hamano, Ævar Arnfjörð Bjarmason, Johannes Schindelin, Victoria Dye Yeah, that's a weird one. I wonder whether the partial clone filter setting should also imply a config shutting off diffstats? - Emily On Thu, Sep 22, 2022 at 7:24 AM Derrick Stolee <derrickstolee@github.com> wrote: > > On 9/21/2022 7:36 PM, Elijah Newren wrote: > > On Wed, Sep 21, 2022 at 8:22 AM ZheNing Hu <adlternative@gmail.com> wrote: > > > Here, we do have an object download, which occurred after the merge > > completed, so there must be something happening after the merge which > > needs the extra blob; if we keep reading... > > > >> project1/file1 | 10 ++++++++++ > >> 1 file changed, 10 insertions(+) > > > > Ah, the 'helpful' diffstat. It downloads blobs from a promisor remote > > just so we can see what has changed, including in the area of the > > project we don't care about. > > > > (This is yet another reason it'd be nice to have a --restrict mode for > > grep/diff/log/etc. for sparse-checkout uses, and an ability to make it > > the default in some repo, so you could get just the diffstat within > > the region of the project that you care about. We're discussing such > > an idea, but it isn't implemented yet.) > > > >> warning: This repository uses promisor remotes. Some objects may not be loaded. > >> blob_count1=11 > >> blob_count2=11 > >> blob_count3=12 > >> > >> The result shows that blob count doesn't change in git fetch, but in git merge. > > > > If you add --no-stat to your merge command (or set merge.stat to > > false), the extra blob will not be downloaded. > > This is an interesting find! I wonder how many people are hitting this > in the wild. Perhaps merge.stat should be added to the optional, but > recommended config options in scalar.c. > > Thanks, > -Stolee ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Question relate to collaboration on git monorepo 2022-09-22 15:20 ` Emily Shaffer @ 2022-09-23 2:08 ` Elijah Newren 0 siblings, 0 replies; 12+ messages in thread From: Elijah Newren @ 2022-09-23 2:08 UTC (permalink / raw) To: Emily Shaffer Cc: Derrick Stolee, ZheNing Hu, Git List, Junio C Hamano, Ævar Arnfjörð Bjarmason, Johannes Schindelin, Victoria Dye On Thu, Sep 22, 2022 at 8:21 AM Emily Shaffer <emilyshaffer@google.com> wrote: > > On Thu, Sep 22, 2022 at 7:24 AM Derrick Stolee <derrickstolee@github.com> wrote: > > > > On 9/21/2022 7:36 PM, Elijah Newren wrote: > > > On Wed, Sep 21, 2022 at 8:22 AM ZheNing Hu <adlternative@gmail.com> wrote: [...] > > > If you add --no-stat to your merge command (or set merge.stat to > > > false), the extra blob will not be downloaded. > > > > This is an interesting find! I wonder how many people are hitting this > > in the wild. Perhaps merge.stat should be added to the optional, but > > recommended config options in scalar.c. > > Yeah, that's a weird one. I wonder whether the partial clone filter > setting should also imply a config shutting off diffstats? Yeah, ZheNing hit on an interesting case here, and their very careful problem description with full testcases helped immensely in finding the issue. I didn't spot it at first from the output, but being able to run his examples made it relatively easy to track down -- and then I noticed I could have figured out the same information from his output if I had known to look. Even if I didn't notice the output the first time around, having the problem be discoverable from his program output certainly made it easier to describe and explain the issue. :-) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Question relate to collaboration on git monorepo 2022-09-22 14:24 ` Derrick Stolee 2022-09-22 15:20 ` Emily Shaffer @ 2022-09-23 15:46 ` Junio C Hamano 2022-09-23 18:11 ` Derrick Stolee 1 sibling, 1 reply; 12+ messages in thread From: Junio C Hamano @ 2022-09-23 15:46 UTC (permalink / raw) To: Derrick Stolee Cc: Elijah Newren, ZheNing Hu, Emily Shaffer, Git List, Ævar Arnfjörð Bjarmason, Johannes Schindelin, Victoria Dye Derrick Stolee <derrickstolee@github.com> writes: > On 9/21/2022 7:36 PM, Elijah Newren wrote: >> On Wed, Sep 21, 2022 at 8:22 AM ZheNing Hu <adlternative@gmail.com> wrote: > >> Here, we do have an object download, which occurred after the merge >> completed, so there must be something happening after the merge which >> needs the extra blob; if we keep reading... >> >>> project1/file1 | 10 ++++++++++ >>> 1 file changed, 10 insertions(+) >> >> Ah, the 'helpful' diffstat. It downloads blobs from a promisor remote >> just so we can see what has changed, including in the area of the >> project we don't care about. >> ... > This is an interesting find! I wonder how many people are hitting this > in the wild. Perhaps merge.stat should be added to the optional, but > recommended config options in scalar.c. Hmph. It somehow sounds like throwing the baby out with the bathwater, doesn't it. You are only interested in a few directories in the project. You pull from somebody else (or the central repository), and end up getting updates to both inside and outside the areas of your interest. As a project gets larger and better modularized, does it become more likely that such an update will happen more often? I am very tempted to suggest that the diffstat after a merge in such a project should use the sparse cone(s) as pathspec. Disabling the "what happened in this merge" report altogether is a way too blunt instrument, isn't it? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Question relate to collaboration on git monorepo 2022-09-23 15:46 ` Junio C Hamano @ 2022-09-23 18:11 ` Derrick Stolee 0 siblings, 0 replies; 12+ messages in thread From: Derrick Stolee @ 2022-09-23 18:11 UTC (permalink / raw) To: Junio C Hamano Cc: Elijah Newren, ZheNing Hu, Emily Shaffer, Git List, Ævar Arnfjörð Bjarmason, Johannes Schindelin, Victoria Dye On 9/23/2022 11:46 AM, Junio C Hamano wrote: > Derrick Stolee <derrickstolee@github.com> writes: >> This is an interesting find! I wonder how many people are hitting this >> in the wild. Perhaps merge.stat should be added to the optional, but >> recommended config options in scalar.c. > > Hmph. It somehow sounds like throwing the baby out with the > bathwater, doesn't it. > > You are only interested in a few directories in the project. You > pull from somebody else (or the central repository), and end up > getting updates to both inside and outside the areas of your > interest. > > As a project gets larger and better modularized, does it become more > likely that such an update will happen more often? > > I am very tempted to suggest that the diffstat after a merge in such > a project should use the sparse cone(s) as pathspec. Disabling the > "what happened in this merge" report altogether is a way too blunt > instrument, isn't it? You're right that having a sparse-checkout-scoped diffstat is likely to be interesting in the long term. It does require updating its format to describe "some paths outside the sparse-checkout changed, but we won't give you the full diffstat there". Using something like the config to disable it in the short term would certainly hide the pain point, but you're right that we shouldn't stop there. Thanks, -Stolee ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Question relate to collaboration on git monorepo 2022-09-21 23:36 ` Elijah Newren 2022-09-22 14:24 ` Derrick Stolee @ 2022-09-23 14:31 ` ZheNing Hu 1 sibling, 0 replies; 12+ messages in thread From: ZheNing Hu @ 2022-09-23 14:31 UTC (permalink / raw) To: Elijah Newren Cc: Emily Shaffer, Git List, Derrick Stolee, Junio C Hamano, Ævar Arnfjörð Bjarmason, Johannes Schindelin, Victoria Dye Elijah Newren <newren@gmail.com> 于2022年9月22日周四 07:36写道: > > On Wed, Sep 21, 2022 at 8:22 AM ZheNing Hu <adlternative@gmail.com> wrote: > > > > Emily Shaffer <emilyshaffer@google.com> 于2022年9月21日周三 02:53写道: > > > > > > On Tue, Sep 20, 2022 at 5:42 AM ZheNing Hu <adlternative@gmail.com> wrote: > > > > > > > > Hey, guys, > > > > > > > > If two users of git monorepo are working on different sub project > > > > /project1 and /project2 by partial-clone and sparse-checkout , > > > > if user one push first, then user two want to push too, he must > > > > pull some blob which pushed by user one. I guess their repo size > > > > will gradually increase by other's project's objects, so is there any way > > > > to delete unnecessary blobs out of working project (sparse-checkout > > > > filterspec), or just git pull don't really fetch these unnecessary blobs? > > > > > > This is exactly what the combination of partial clone and sparse > > > checkout is for! > > > > > > Dev A is working on project1/, and excludes project2/ from her sparse > > > filter; she also cloned with `--filter=blob:none`. > > > Dev B is working on project2/, and excludes project1/ from his sparse > > > filter, and similarly is using blob:none partial clone filter. > > > > > > Assuming everybody is contributing by direct push, and not using a > > > code review tool or something else which handles the push for them... > > > Dev A finishes first, and pushes. > > > Dev B needs to pull, like you say - but during that pull he doesn't > > > need to fetch the objects in project1, because they're excluded by the > > > combination of his partial clone filter and his sparse checkout > > > pattern. The pull needs to happen because there is a new commit which > > > Dev B's commit needs to treat as a parent, and so Dev B's client needs > > > to know the ID of that commit. > > > > > > > I don't agree here, it indeed fetches the blobs during git pull. So I > > do a little > > change in the previous test: > > > > ( > > cd m2 > > git cat-file --batch-check --batch-all-objects | grep blob | wc -l > > > blob_count1 > > # git push > > # git -c pull.rebase=false pull --no-edit #no conflict > > git fetch origin main > > git cat-file --batch-check --batch-all-objects | grep blob | wc -l > > > blob_count2 > > git merge --no-edit origin/main > > git cat-file --batch-check --batch-all-objects | grep blob | wc -l > > > blob_count3 > > printf "blob_count1=%s\n" $(cat blob_count1) > > printf "blob_count2=%s\n" $(cat blob_count2) > > printf "blob_count3=%s\n" $(cat blob_count3) > > ) > > > > warning: This repository uses promisor remotes. Some objects may not be loaded. > > remote: Enumerating objects: 32, done. > > remote: Counting objects: 100% (32/32), done. > > remote: Compressing objects: 100% (20/20), done. > > remote: Total 30 (delta 0), reused 0 (delta 0), pack-reused 0 > > Receiving objects: 100% (30/30), 2.61 KiB | 2.61 MiB/s, done. > > From /Users/adl/./mono-repo > > * branch main -> FETCH_HEAD > > a6a17f2..16a8585 main -> origin/main > > warning: This repository uses promisor remotes. Some objects may not be loaded. > > Merge made by the 'ort' strategy. > > Note: The merge completed successfully, and we see no evidence of > additional blobs being downloaded before this point. > Agree. Debug message This is not a problem caused by git merge, but caused by "finish" period of git merge, which fetch missing objects to show the diffstat. (lldb) b fetch_objects Breakpoint 1: where = git`fetch_objects + 29 at promisor-remote.c:18:23, address = 0x0000000100275f4d (lldb) r Process 62227 launched: '/Users/adl/repos/git/git' (x86_64) Merge made by the 'ort' strategy. Process 62227 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x0000000100275f4d git`fetch_objects(repo=0x0000000100406a88, remote_name="origin", oids=0x0000000101204360, oid_nr=1) at promisor-remote.c:18:23 15 const struct object_id *oids, 16 int oid_nr) 17 { -> 18 struct child_process child = CHILD_PROCESS_INIT; 19 int i; 20 FILE *child_in; 21 Target 0: (git) stopped. (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 * frame #0: 0x0000000100275f4d git`fetch_objects(repo=0x0000000100406a88, remote_name="origin", oids=0x0000000101204360, oid_nr=1) at promisor-remote.c:18:23 frame #1: 0x0000000100275ea3 git`promisor_remote_get_direct(repo=0x0000000100406a88, oids=0x0000000101204360, oid_nr=1) at promisor-remote.c:249:7 frame #2: 0x00000001001a2fe3 git`diff_queued_diff_prefetch(repository=0x0000000100406a88) at diff.c:6781:2 frame #3: 0x00000001001a3075 git`diffcore_std(options=0x00007ff7bfefed20) at diff.c:6805:3 frame #4: 0x000000010009ca11 git`finish(head_commit=0x000000010151f000, remoteheads=0x0000600000004390, new_head=0x00007ff7bfeff030, msg="Merge made by the 'ort' strategy.") at merge.c:499:3 frame #5: 0x000000010009d787 git`finish_automerge(head=0x000000010151f000, head_subsumed=0, common=0x0000600000004330, remoteheads=0x0000600000004390, result_tree=0x00007ff7bfeff280, wt_strategy="ort") at merge.c:960:2 frame #6: 0x000000010009b07b git`cmd_merge(argc=1, argv=0x00007ff7bfeff660, prefix=0x0000000000000000) at merge.c:1743:9 frame #7: 0x0000000100005573 git`run_builtin(p=0x00000001003e0e60, argc=3, argv=0x00007ff7bfeff660) at git.c:466:11 frame #8: 0x0000000100004098 git`handle_builtin(argc=3, argv=0x00007ff7bfeff660) at git.c:721:3 frame #9: 0x0000000100004f76 git`run_argv(argcp=0x00007ff7bfeff4dc, argv=0x00007ff7bfeff4d0) at git.c:788:4 frame #10: 0x0000000100003e69 git`cmd_main(argc=3, argv=0x00007ff7bfeff660) at git.c:921:19 frame #11: 0x000000010011e8f6 git`main(argc=4, argv=0x00007ff7bfeff658) at common-main.c:56:11 frame #12: 0x00000001005b94fe dyld`start + 462 > > remote: Enumerating objects: 1, done. > > remote: Counting objects: 100% (1/1), done. > > remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 > > Receiving objects: 100% (1/1), 87 bytes | 87.00 KiB/s, done. > > Here, we do have an object download, which occurred after the merge > completed, so there must be something happening after the merge which > needs the extra blob; if we keep reading... > > > project1/file1 | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > Ah, the 'helpful' diffstat. It downloads blobs from a promisor remote > just so we can see what has changed, including in the area of the > project we don't care about. > > (This is yet another reason it'd be nice to have a --restrict mode for > grep/diff/log/etc. for sparse-checkout uses, and an ability to make it > the default in some repo, so you could get just the diffstat within > the region of the project that you care about. We're discussing such > an idea, but it isn't implemented yet.) > > > warning: This repository uses promisor remotes. Some objects may not be loaded. > > blob_count1=11 > > blob_count2=11 > > blob_count3=12 > > > > The result shows that blob count doesn't change in git fetch, but in git merge. > > If you add --no-stat to your merge command (or set merge.stat to > false), the extra blob will not be downloaded. After config merge.stat to false, the problem is solved. Thanks a lot! ZheNing Hu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Question relate to collaboration on git monorepo 2022-09-20 12:42 Question relate to collaboration on git monorepo ZheNing Hu 2022-09-20 18:53 ` Emily Shaffer @ 2022-09-21 1:47 ` Elijah Newren 2022-09-21 15:42 ` ZheNing Hu 1 sibling, 1 reply; 12+ messages in thread From: Elijah Newren @ 2022-09-21 1:47 UTC (permalink / raw) To: ZheNing Hu Cc: Git List, Derrick Stolee, Junio C Hamano, Ævar Arnfjörð Bjarmason, Johannes Schindelin, Victoria Dye On Tue, Sep 20, 2022 at 5:42 AM ZheNing Hu <adlternative@gmail.com> wrote: > > Hey, guys, > > If two users of git monorepo are working on different sub project > /project1 and /project2 by partial-clone and sparse-checkout , > if user one push first, then user two want to push too, he must > pull some blob which pushed by user one. This is not true. While user two must pull the new commit and any new trees pushed by user one (which will mean knowing the hashes of the new files), there is no need to download the actual content of the new files unless and until some git command is run that attempts to view the file's contents. > The large number of interruptions in git push may be another > problem, if thousands of probjects are in one monorepo, and > no one else has any code that would conflict with me in any way, > but I need pull everytime? Is there a way to make improvements > here? No, you only need to pull when attempting to push back to the server. Further, if you're worried that the second push will fail, you could easily script it and put "pull --rebase && push" in a loop until it succeeds (I mean, you did say no one would have any conflicts). In fact, you could just make that a common script distributed to your users and tell them to run that instead of "git push" if they don't want to worry about manually updating. Now, if you have thousands of nearly fully independent subprojects and lots of developers for each subproject and they all commit & push *very* frequently, I guess you might be able to eventually get to the scale where you are worried there will be so much contention that the script will take too long. I'd be surprised if you got that far, but even if you did, you could easily adopt a lieutenant-like workflow (somewhat like the linux kernel, but even simpler given the independence of your projects). In such a workflow, you'd let people in subprojects push to their subproject fork (instead of to the "main" or "central" repository), and the lieutenants of the subprojects then periodically push work from that subproject to the main project in batches. I don't really see much need here for improvements, myself. > Here's an example of how two users constrain each other when git push. Did you pay attention to warnings you got along the way? In particular... > git clone --bare mono-repo You missed the following command right after your clone: git -C mono-repo.git config uploadpack.allowFilter true > # user1 > rm -rf m1 > git clone --filter="blob:none" --no-checkout --no-local ./mono-repo.git m1 Since you forgot to set the important config I mentioned above, your command here generates the following line of output, among others: warning: filtering not recognized by server, ignoring This warning means you weren't testing partial clones, but regular full clones. Perhaps that was the cause of your confusion? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Question relate to collaboration on git monorepo 2022-09-21 1:47 ` Elijah Newren @ 2022-09-21 15:42 ` ZheNing Hu 0 siblings, 0 replies; 12+ messages in thread From: ZheNing Hu @ 2022-09-21 15:42 UTC (permalink / raw) To: Elijah Newren Cc: Git List, Derrick Stolee, Junio C Hamano, Ævar Arnfjörð Bjarmason, Johannes Schindelin, Victoria Dye Elijah Newren <newren@gmail.com> 于2022年9月21日周三 09:48写道: > > On Tue, Sep 20, 2022 at 5:42 AM ZheNing Hu <adlternative@gmail.com> wrote: > > > > Hey, guys, > > > > If two users of git monorepo are working on different sub project > > /project1 and /project2 by partial-clone and sparse-checkout , > > if user one push first, then user two want to push too, he must > > pull some blob which pushed by user one. > > This is not true. While user two must pull the new commit and any new > trees pushed by user one (which will mean knowing the hashes of the > new files), there is no need to download the actual content of the new > files unless and until some git command is run that attempts to view > the file's contents. > Yeah, now I understand that git fetch will not download blobs out of the sparse-checkout pattern, but git merge will. So git pull will download some missing blobs here. > > The large number of interruptions in git push may be another > > problem, if thousands of probjects are in one monorepo, and > > no one else has any code that would conflict with me in any way, > > but I need pull everytime? Is there a way to make improvements > > here? > > No, you only need to pull when attempting to push back to the server. > > Further, if you're worried that the second push will fail, you could > easily script it and put "pull --rebase && push" in a loop until it > succeeds (I mean, you did say no one would have any conflicts). In > fact, you could just make that a common script distributed to your > users and tell them to run that instead of "git push" if they don't > want to worry about manually updating. > Ah, This method looks a little funny, but it maybe can work. This issue may also apply to some Code Review tools, maybe need a "pull --rebase && git cr" loop. > Now, if you have thousands of nearly fully independent subprojects and > lots of developers for each subproject and they all commit & push > *very* frequently, I guess you might be able to eventually get to the > scale where you are worried there will be so much contention that the > script will take too long. I'd be surprised if you got that far, but > even if you did, you could easily adopt a lieutenant-like workflow > (somewhat like the linux kernel, but even simpler given the > independence of your projects). In such a workflow, you'd let people > in subprojects push to their subproject fork (instead of to the "main" > or "central" repository), and the lieutenants of the subprojects then > periodically push work from that subproject to the main project in > batches. > Make sense. When this mono-repo really has this kind of scale, splitting the workflow might be the right thing to do. > I don't really see much need here for improvements, myself. > > > Here's an example of how two users constrain each other when git push. > > Did you pay attention to warnings you got along the way? In particular... > > > git clone --bare mono-repo > > You missed the following command right after your clone: > > git -C mono-repo.git config uploadpack.allowFilter true > > > # user1 > > rm -rf m1 > > git clone --filter="blob:none" --no-checkout --no-local ./mono-repo.git m1 > > Since you forgot to set the important config I mentioned above, your > command here generates the following line of output, among others: > > warning: filtering not recognized by server, ignoring > > This warning means you weren't testing partial clones, but regular > full clones. Perhaps that was the cause of your confusion? Oh, sorry for forget record this, I have config them globally: uploadpack.allowanysha1inwant=true uploadpack.allowfilter=true Thanks for the answer, ZheNing Hu ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2022-09-23 18:11 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-09-20 12:42 Question relate to collaboration on git monorepo ZheNing Hu 2022-09-20 18:53 ` Emily Shaffer 2022-09-21 15:22 ` ZheNing Hu 2022-09-21 23:36 ` Elijah Newren 2022-09-22 14:24 ` Derrick Stolee 2022-09-22 15:20 ` Emily Shaffer 2022-09-23 2:08 ` Elijah Newren 2022-09-23 15:46 ` Junio C Hamano 2022-09-23 18:11 ` Derrick Stolee 2022-09-23 14:31 ` ZheNing Hu 2022-09-21 1:47 ` Elijah Newren 2022-09-21 15:42 ` ZheNing Hu
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).