* Question: How to execute git-gc correctly on the git server @ 2022-12-07 15:58 ZheNing Hu 2022-12-07 23:57 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 14+ messages in thread From: ZheNing Hu @ 2022-12-07 15:58 UTC (permalink / raw) To: Git List Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano, Christian Couder, johncai86, Taylor Blau Hi, I would like to run git gc on my git server periodically, which should help reduce storage space and optimize the read performance of the repository. I know github, gitlab all have this process... But the concurrency between git gc and other git commands is holding me back a bit. git-gc [1] docs say: On the other hand, when git gc runs concurrently with another process, there is a risk of it deleting an object that the other process is using but hasn’t created a reference to. This may just cause the other process to fail or may corrupt the repository if the other process later adds a reference to the deleted object. It seems that git gc is a dangerous operation that may cause data corruption concurrently with other git commands. Then I read the contents of Github's blog [2], git gc ---cruft seems to be used to keep those expiring unreachable objects in a cruft pack, but the blog says github use some special "limbo" repository to keep the cruft pack for git data recover. Well, a lot of the details here are pretty hard to understand for me :( However, on the other hand, my git server is still at v2.35, and --cruft was introduced in v2.38, so I'm actually more curious about: how did the server execute git gc correctly in the past? Do we need a repository level "big lock" that blocks most/all other git operations? What should the behavior of users' git clone/push be at this time? Report error that the git server is performing git gc? Or just wait for git gc to complete? Thanks for any comments and help! [1]: https://git-scm.com/docs/git-gc [2]: https://github.blog/2022-09-13-scaling-gits-garbage-collection/ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question: How to execute git-gc correctly on the git server 2022-12-07 15:58 Question: How to execute git-gc correctly on the git server ZheNing Hu @ 2022-12-07 23:57 ` Ævar Arnfjörð Bjarmason 2022-12-08 1:16 ` Michal Suchánek 2022-12-08 6:59 ` Jeff King 0 siblings, 2 replies; 14+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-07 23:57 UTC (permalink / raw) To: ZheNing Hu Cc: Git List, Junio C Hamano, Christian Couder, johncai86, Taylor Blau On Wed, Dec 07 2022, ZheNing Hu wrote: > I would like to run git gc on my git server periodically, which should help > reduce storage space and optimize the read performance of the repository. > I know github, gitlab all have this process... > > But the concurrency between git gc and other git commands is holding > me back a bit. > > git-gc [1] docs say: > > On the other hand, when git gc runs concurrently with another process, > there is a risk of it deleting an object that the other process is using but > hasn’t created a reference to. This may just cause the other process to > fail or may corrupt the repository if the other process later adds > a reference > to the deleted object. > > It seems that git gc is a dangerous operation that may cause data corruption > concurrently with other git commands. > > Then I read the contents of Github's blog [2], git gc ---cruft seems to be used > to keep those expiring unreachable objects in a cruft pack, but the blog says > github use some special "limbo" repository to keep the cruft pack for git data > recover. Well, a lot of the details here are pretty hard to understand for me :( > > However, on the other hand, my git server is still at v2.35, and --cruft was > introduced in v2.38, so I'm actually more curious about: how did the server > execute git gc correctly in the past? Do we need a repository level "big lock" > that blocks most/all other git operations? What should the behavior of users' > git clone/push be at this time? Report error that the git server is performing > git gc? Or just wait for git gc to complete? > > Thanks for any comments and help! > > [1]: https://git-scm.com/docs/git-gc > [2]: https://github.blog/2022-09-13-scaling-gits-garbage-collection/ Is this for a very large hosting site that's anywhere near GitHub, GitLab's etc. scale? A "git gc" on a "live" repo is always racy in theory, but the odds that you'll run into data corrupting trouble tends to approach zero as you increase the gc.pruneExpire setting, with the default 2 weeks being more than enough for even the most paranoid user. The "cruft pack" facility does many different things, and my understanding of it is that GitHub's not using it only as an end-run around potential corruption issues, but that some not yet in tree patches on top of it allow more aggressive "gc" without the fear of corruption. So, I think you probably don't need to worry about it. Other major hosting sites do run "git gc" on live repositories, but as always take backups etc. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question: How to execute git-gc correctly on the git server 2022-12-07 23:57 ` Ævar Arnfjörð Bjarmason @ 2022-12-08 1:16 ` Michal Suchánek 2022-12-08 7:01 ` Jeff King 2022-12-09 7:15 ` ZheNing Hu 2022-12-08 6:59 ` Jeff King 1 sibling, 2 replies; 14+ messages in thread From: Michal Suchánek @ 2022-12-08 1:16 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: ZheNing Hu, Git List, Junio C Hamano, Christian Couder, johncai86, Taylor Blau On Thu, Dec 08, 2022 at 12:57:45AM +0100, Ævar Arnfjörð Bjarmason wrote: > > On Wed, Dec 07 2022, ZheNing Hu wrote: > > > I would like to run git gc on my git server periodically, which should help > > reduce storage space and optimize the read performance of the repository. > > I know github, gitlab all have this process... > > > > But the concurrency between git gc and other git commands is holding > > me back a bit. > > > > git-gc [1] docs say: > > > > On the other hand, when git gc runs concurrently with another process, > > there is a risk of it deleting an object that the other process is using but > > hasn’t created a reference to. This may just cause the other process to > > fail or may corrupt the repository if the other process later adds > > a reference > > to the deleted object. > > > > It seems that git gc is a dangerous operation that may cause data corruption > > concurrently with other git commands. > > > > Then I read the contents of Github's blog [2], git gc ---cruft seems to be used > > to keep those expiring unreachable objects in a cruft pack, but the blog says > > github use some special "limbo" repository to keep the cruft pack for git data > > recover. Well, a lot of the details here are pretty hard to understand for me :( > > > > However, on the other hand, my git server is still at v2.35, and --cruft was > > introduced in v2.38, so I'm actually more curious about: how did the server > > execute git gc correctly in the past? Do we need a repository level "big lock" > > that blocks most/all other git operations? What should the behavior of users' > > git clone/push be at this time? Report error that the git server is performing > > git gc? Or just wait for git gc to complete? > > > > Thanks for any comments and help! > > > > [1]: https://git-scm.com/docs/git-gc > > [2]: https://github.blog/2022-09-13-scaling-gits-garbage-collection/ > > Is this for a very large hosting site that's anywhere near GitHub, > GitLab's etc. scale? > > A "git gc" on a "live" repo is always racy in theory, but the odds that > you'll run into data corrupting trouble tends to approach zero as you > increase the gc.pruneExpire setting, with the default 2 weeks being more > than enough for even the most paranoid user. And that two weeks expiration applies to what, exactly? For commits there is author date and commit date but many other objecs won't have these I suppose. And the date when the object is pushed into the repository is unrelated to these two, anyway. > So, I think you probably don't need to worry about it. Other major > hosting sites do run "git gc" on live repositories, but as always take > backups etc. Actually, it is a real problem. With <100 users and some scripting I got unexplained repository corruptions which went away when gc was disabled. YMMV Bad locking design is always a landmine waiting to get triggered. If you step carefully you might avoid it for some time. Thanks Michal ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question: How to execute git-gc correctly on the git server 2022-12-08 1:16 ` Michal Suchánek @ 2022-12-08 7:01 ` Jeff King 2022-12-09 0:49 ` Michal Suchánek 2022-12-09 7:15 ` ZheNing Hu 1 sibling, 1 reply; 14+ messages in thread From: Jeff King @ 2022-12-08 7:01 UTC (permalink / raw) To: Michal Suchánek Cc: Ævar Arnfjörð Bjarmason, ZheNing Hu, Git List, Junio C Hamano, Christian Couder, johncai86, Taylor Blau On Thu, Dec 08, 2022 at 02:16:31AM +0100, Michal Suchánek wrote: > > A "git gc" on a "live" repo is always racy in theory, but the odds that > > you'll run into data corrupting trouble tends to approach zero as you > > increase the gc.pruneExpire setting, with the default 2 weeks being more > > than enough for even the most paranoid user. > > And that two weeks expiration applies to what, exactly? > > For commits there is author date and commit date but many other objecs > won't have these I suppose. And the date when the object is pushed into > the repository is unrelated to these two, anyway. In this case it's the mtime on the object file (or the pack containing it). But yes, it is far from a complete race-free solution. -Peff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question: How to execute git-gc correctly on the git server 2022-12-08 7:01 ` Jeff King @ 2022-12-09 0:49 ` Michal Suchánek 2022-12-09 1:37 ` Jeff King 0 siblings, 1 reply; 14+ messages in thread From: Michal Suchánek @ 2022-12-09 0:49 UTC (permalink / raw) To: Jeff King Cc: Ævar Arnfjörð Bjarmason, ZheNing Hu, Git List, Junio C Hamano, Christian Couder, johncai86, Taylor Blau On Thu, Dec 08, 2022 at 02:01:05AM -0500, Jeff King wrote: > On Thu, Dec 08, 2022 at 02:16:31AM +0100, Michal Suchánek wrote: > > > > A "git gc" on a "live" repo is always racy in theory, but the odds that > > > you'll run into data corrupting trouble tends to approach zero as you > > > increase the gc.pruneExpire setting, with the default 2 weeks being more > > > than enough for even the most paranoid user. > > > > And that two weeks expiration applies to what, exactly? > > > > For commits there is author date and commit date but many other objecs > > won't have these I suppose. And the date when the object is pushed into > > the repository is unrelated to these two, anyway. > > In this case it's the mtime on the object file (or the pack containing > it). But yes, it is far from a complete race-free solution. So if you are pushing a branch that happens to reuse commits or other objects from an earlier branh that might have been collected ín the meantime you are basically doomed. How likely that is depends a lot on your workflow. People deleting a branch and then pushing another variant in which many objects are the same is a risk. People exporting files from somewhere and adding them to the repo which are bit-identical when independently exported by multiple people and sometimes deleting branches is a risk. ... Thanks Michal ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question: How to execute git-gc correctly on the git server 2022-12-09 0:49 ` Michal Suchánek @ 2022-12-09 1:37 ` Jeff King 2022-12-09 7:26 ` ZheNing Hu 0 siblings, 1 reply; 14+ messages in thread From: Jeff King @ 2022-12-09 1:37 UTC (permalink / raw) To: Michal Suchánek Cc: Ævar Arnfjörð Bjarmason, ZheNing Hu, Git List, Junio C Hamano, Christian Couder, johncai86, Taylor Blau On Fri, Dec 09, 2022 at 01:49:18AM +0100, Michal Suchánek wrote: > > In this case it's the mtime on the object file (or the pack containing > > it). But yes, it is far from a complete race-free solution. > > So if you are pushing a branch that happens to reuse commits or other > objects from an earlier branh that might have been collected ín the > meantime you are basically doomed. Basically yes. We do "freshen" the mtimes on object files when we omit an object write (e.g., your index state ends up at the same tree as an old one). But for a push, there is no freshening. We check the graph at the time of the push and decide if we have everything we need (either newly pushed, or from what we already had in the repo). And that is what's racy; somebody might be deleting as that check is happening. > People deleting a branch and then pushing another variant in which many > objects are the same is a risk. > > People exporting files from somewhere and adding them to the repo which > are bit-identical when independently exported by multiple people and > sometimes deleting branches is a risk. Yes, both of those are risky (along with many other variants). -Peff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question: How to execute git-gc correctly on the git server 2022-12-09 1:37 ` Jeff King @ 2022-12-09 7:26 ` ZheNing Hu 2022-12-09 13:48 ` Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 14+ messages in thread From: ZheNing Hu @ 2022-12-09 7:26 UTC (permalink / raw) To: Jeff King Cc: Michal Suchánek, Ævar Arnfjörð Bjarmason, Git List, Junio C Hamano, Christian Couder, johncai86, Taylor Blau Jeff King <peff@peff.net> 于2022年12月9日周五 09:37写道: > > On Fri, Dec 09, 2022 at 01:49:18AM +0100, Michal Suchánek wrote: > > > > In this case it's the mtime on the object file (or the pack containing > > > it). But yes, it is far from a complete race-free solution. > > > > So if you are pushing a branch that happens to reuse commits or other > > objects from an earlier branh that might have been collected ín the > > meantime you are basically doomed. > > Basically yes. We do "freshen" the mtimes on object files when we omit > an object write (e.g., your index state ends up at the same tree as an > old one). But for a push, there is no freshening. We check the graph at > the time of the push and decide if we have everything we need (either > newly pushed, or from what we already had in the repo). And that is > what's racy; somebody might be deleting as that check is happening. > > > People deleting a branch and then pushing another variant in which many > > objects are the same is a risk. > > > > People exporting files from somewhere and adding them to the repo which > > are bit-identical when independently exported by multiple people and > > sometimes deleting branches is a risk. > > Yes, both of those are risky (along with many other variants). > I'm wondering if there's an easy and poor performance way to do gc safely? For example, add a file lock to the repository during git push and git gc? > -Peff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question: How to execute git-gc correctly on the git server 2022-12-09 7:26 ` ZheNing Hu @ 2022-12-09 13:48 ` Ævar Arnfjörð Bjarmason 2022-12-11 16:01 ` ZheNing Hu 0 siblings, 1 reply; 14+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-09 13:48 UTC (permalink / raw) To: ZheNing Hu Cc: Jeff King, Michal Suchánek, Git List, Junio C Hamano, Christian Couder, johncai86, Taylor Blau On Fri, Dec 09 2022, ZheNing Hu wrote: > Jeff King <peff@peff.net> 于2022年12月9日周五 09:37写道: >> >> On Fri, Dec 09, 2022 at 01:49:18AM +0100, Michal Suchánek wrote: >> >> > > In this case it's the mtime on the object file (or the pack containing >> > > it). But yes, it is far from a complete race-free solution. >> > >> > So if you are pushing a branch that happens to reuse commits or other >> > objects from an earlier branh that might have been collected ín the >> > meantime you are basically doomed. >> >> Basically yes. We do "freshen" the mtimes on object files when we omit >> an object write (e.g., your index state ends up at the same tree as an >> old one). But for a push, there is no freshening. We check the graph at >> the time of the push and decide if we have everything we need (either >> newly pushed, or from what we already had in the repo). And that is >> what's racy; somebody might be deleting as that check is happening. >> >> > People deleting a branch and then pushing another variant in which many >> > objects are the same is a risk. >> > >> > People exporting files from somewhere and adding them to the repo which >> > are bit-identical when independently exported by multiple people and >> > sometimes deleting branches is a risk. >> >> Yes, both of those are risky (along with many other variants). >> > > I'm wondering if there's an easy and poor performance way to do > gc safely? For example, add a file lock to the repository during > git push and git gc? We don't have any "easy" way to do it, but we probably should. The root cause of the race is tricky to fix, and we don't have any "global ref lock". But in the context of a client<->server and wanting to do gc on the server a good enough and easy solution would be e.g.: 1. Have a {pre,post}-receive hook logging attempted/finished pushes 2. Have the pre-receive hook able to reject (or better yet, hang with sleep()) incoming deletions 3. Do a gc with a small wrapper script, which: - Flips the "no deletion ops now" (or "delay deletion ops") switch - Polls until it's sure there's no relevant in-progress operations - Do a full gc - Unlock You'd need to be certain that all relevant repo operations are going through git-receive-pack etc., i.e. a local "git branch -d" or the like won't run {pre,post}-receive. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question: How to execute git-gc correctly on the git server 2022-12-09 13:48 ` Ævar Arnfjörð Bjarmason @ 2022-12-11 16:01 ` ZheNing Hu 2022-12-11 16:27 ` Michal Suchánek 0 siblings, 1 reply; 14+ messages in thread From: ZheNing Hu @ 2022-12-11 16:01 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Jeff King, Michal Suchánek, Git List, Junio C Hamano, Christian Couder, johncai86, Taylor Blau Ævar Arnfjörð Bjarmason <avarab@gmail.com> 于2022年12月9日周五 21:52写道: > > > On Fri, Dec 09 2022, ZheNing Hu wrote: > > > Jeff King <peff@peff.net> 于2022年12月9日周五 09:37写道: > >> > >> On Fri, Dec 09, 2022 at 01:49:18AM +0100, Michal Suchánek wrote: > >> > >> > > In this case it's the mtime on the object file (or the pack containing > >> > > it). But yes, it is far from a complete race-free solution. > >> > > >> > So if you are pushing a branch that happens to reuse commits or other > >> > objects from an earlier branh that might have been collected ín the > >> > meantime you are basically doomed. > >> > >> Basically yes. We do "freshen" the mtimes on object files when we omit > >> an object write (e.g., your index state ends up at the same tree as an > >> old one). But for a push, there is no freshening. We check the graph at > >> the time of the push and decide if we have everything we need (either > >> newly pushed, or from what we already had in the repo). And that is > >> what's racy; somebody might be deleting as that check is happening. > >> > >> > People deleting a branch and then pushing another variant in which many > >> > objects are the same is a risk. > >> > > >> > People exporting files from somewhere and adding them to the repo which > >> > are bit-identical when independently exported by multiple people and > >> > sometimes deleting branches is a risk. > >> > >> Yes, both of those are risky (along with many other variants). > >> > > > > I'm wondering if there's an easy and poor performance way to do > > gc safely? For example, add a file lock to the repository during > > git push and git gc? > > We don't have any "easy" way to do it, but we probably should. The root > cause of the race is tricky to fix, and we don't have any "global ref > lock". > > But in the context of a client<->server and wanting to do gc on the > server a good enough and easy solution would be e.g.: > > 1. Have a {pre,post}-receive hook logging attempted/finished pushes > 2. Have the pre-receive hook able to reject (or better yet, hang with > sleep()) incoming deletions > 3. Do a gc with a small wrapper script, which: > - Flips the "no deletion ops now" (or "delay deletion ops") switch > - Polls until it's sure there's no relevant in-progress operations > - Do a full gc > - Unlock > Well, I understand that after the branch is deleted, some objects may be unreachable, and then these objects are deleted by concurrent git gc, which leads to data corruption in concurrent git push if these objects need to be referenced, but what I don't understand that is it enough to just block the operation of deleting branches? Once gc happens to be deleting an unreachable object, and git push new branch (git receive-pack) happens to need it, won't it still go wrong? > You'd need to be certain that all relevant repo operations are going > through git-receive-pack etc., i.e. a local "git branch -d" or the like > won't run {pre,post}-receive. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question: How to execute git-gc correctly on the git server 2022-12-11 16:01 ` ZheNing Hu @ 2022-12-11 16:27 ` Michal Suchánek 0 siblings, 0 replies; 14+ messages in thread From: Michal Suchánek @ 2022-12-11 16:27 UTC (permalink / raw) To: ZheNing Hu Cc: Ævar Arnfjörð Bjarmason, Jeff King, Git List, Junio C Hamano, Christian Couder, johncai86, Taylor Blau On Mon, Dec 12, 2022 at 12:01:51AM +0800, ZheNing Hu wrote: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> 于2022年12月9日周五 21:52写道: > > > > > > > On Fri, Dec 09 2022, ZheNing Hu wrote: > > > > > Jeff King <peff@peff.net> 于2022年12月9日周五 09:37写道: > > >> > > >> On Fri, Dec 09, 2022 at 01:49:18AM +0100, Michal Suchánek wrote: > > >> > > >> > > In this case it's the mtime on the object file (or the pack containing > > >> > > it). But yes, it is far from a complete race-free solution. > > >> > > > >> > So if you are pushing a branch that happens to reuse commits or other > > >> > objects from an earlier branh that might have been collected ín the > > >> > meantime you are basically doomed. > > >> > > >> Basically yes. We do "freshen" the mtimes on object files when we omit > > >> an object write (e.g., your index state ends up at the same tree as an > > >> old one). But for a push, there is no freshening. We check the graph at > > >> the time of the push and decide if we have everything we need (either > > >> newly pushed, or from what we already had in the repo). And that is > > >> what's racy; somebody might be deleting as that check is happening. > > >> > > >> > People deleting a branch and then pushing another variant in which many > > >> > objects are the same is a risk. > > >> > > > >> > People exporting files from somewhere and adding them to the repo which > > >> > are bit-identical when independently exported by multiple people and > > >> > sometimes deleting branches is a risk. > > >> > > >> Yes, both of those are risky (along with many other variants). > > >> > > > > > > I'm wondering if there's an easy and poor performance way to do > > > gc safely? For example, add a file lock to the repository during > > > git push and git gc? > > > > We don't have any "easy" way to do it, but we probably should. The root > > cause of the race is tricky to fix, and we don't have any "global ref > > lock". > > > > But in the context of a client<->server and wanting to do gc on the > > server a good enough and easy solution would be e.g.: > > > > 1. Have a {pre,post}-receive hook logging attempted/finished pushes > > 2. Have the pre-receive hook able to reject (or better yet, hang with > > sleep()) incoming deletions > > 3. Do a gc with a small wrapper script, which: > > - Flips the "no deletion ops now" (or "delay deletion ops") switch > > - Polls until it's sure there's no relevant in-progress operations > > - Do a full gc > > - Unlock > > > > Well, I understand that after the branch is deleted, some objects may be > unreachable, and then these objects are deleted by concurrent git gc, > which leads to data corruption in concurrent git push if these objects need > to be referenced, but what I don't understand that is it enough to just block > the operation of deleting branches? Once gc happens to be deleting an > unreachable object, and git push new branch (git receive-pack) happens to > need it, won't it still go wrong? As I understand the problem: - push figures out which objects are missing on the remote end - push starts sending the missing objects - remote gc deletes objects that are not reachable but push assumes remote has them - these might be part of a branch deleted long before gc started - push finishes and branch is advanced to point to an object that references objects that were deleted by gc -> repository is corrupted The only way to prevent this is to not delete anything ever, or to make sure that objects that are part of any ongoing operation are always referenced. Which would probably mean in practice that any operation adding objects needs to add temporary references to any objects it creates or aims to reference, and/or check reachability of referenced objects once the final object is created. Thanks Michal ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question: How to execute git-gc correctly on the git server 2022-12-08 1:16 ` Michal Suchánek 2022-12-08 7:01 ` Jeff King @ 2022-12-09 7:15 ` ZheNing Hu 1 sibling, 0 replies; 14+ messages in thread From: ZheNing Hu @ 2022-12-09 7:15 UTC (permalink / raw) To: Michal Suchánek Cc: Ævar Arnfjörð Bjarmason, Git List, Junio C Hamano, Christian Couder, johncai86, Taylor Blau Michal Suchánek <msuchanek@suse.de> 于2022年12月8日周四 09:16写道: > > On Thu, Dec 08, 2022 at 12:57:45AM +0100, Ævar Arnfjörð Bjarmason wrote: > > > > On Wed, Dec 07 2022, ZheNing Hu wrote: > > > > > I would like to run git gc on my git server periodically, which should help > > > reduce storage space and optimize the read performance of the repository. > > > I know github, gitlab all have this process... > > > > > > But the concurrency between git gc and other git commands is holding > > > me back a bit. > > > > > > git-gc [1] docs say: > > > > > > On the other hand, when git gc runs concurrently with another process, > > > there is a risk of it deleting an object that the other process is using but > > > hasn’t created a reference to. This may just cause the other process to > > > fail or may corrupt the repository if the other process later adds > > > a reference > > > to the deleted object. > > > > > > It seems that git gc is a dangerous operation that may cause data corruption > > > concurrently with other git commands. > > > > > > Then I read the contents of Github's blog [2], git gc ---cruft seems to be used > > > to keep those expiring unreachable objects in a cruft pack, but the blog says > > > github use some special "limbo" repository to keep the cruft pack for git data > > > recover. Well, a lot of the details here are pretty hard to understand for me :( > > > > > > However, on the other hand, my git server is still at v2.35, and --cruft was > > > introduced in v2.38, so I'm actually more curious about: how did the server > > > execute git gc correctly in the past? Do we need a repository level "big lock" > > > that blocks most/all other git operations? What should the behavior of users' > > > git clone/push be at this time? Report error that the git server is performing > > > git gc? Or just wait for git gc to complete? > > > > > > Thanks for any comments and help! > > > > > > [1]: https://git-scm.com/docs/git-gc > > > [2]: https://github.blog/2022-09-13-scaling-gits-garbage-collection/ > > > > Is this for a very large hosting site that's anywhere near GitHub, > > GitLab's etc. scale? > > > > A "git gc" on a "live" repo is always racy in theory, but the odds that > > you'll run into data corrupting trouble tends to approach zero as you > > increase the gc.pruneExpire setting, with the default 2 weeks being more > > than enough for even the most paranoid user. > > And that two weeks expiration applies to what, exactly? > > For commits there is author date and commit date but many other objecs > won't have these I suppose. And the date when the object is pushed into > the repository is unrelated to these two, anyway. > > > So, I think you probably don't need to worry about it. Other major > > hosting sites do run "git gc" on live repositories, but as always take > > backups etc. > > Actually, it is a real problem. With <100 users and some scripting I got > unexplained repository corruptions which went away when gc was disabled. > > YMMV > > Bad locking design is always a landmine waiting to get triggered. If you > step carefully you might avoid it for some time. > I agree with this. What I hope to be able to do more is "no error at all" rather than "small probability of error" > Thanks > > Michal ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question: How to execute git-gc correctly on the git server 2022-12-07 23:57 ` Ævar Arnfjörð Bjarmason 2022-12-08 1:16 ` Michal Suchánek @ 2022-12-08 6:59 ` Jeff King 2022-12-08 12:35 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 14+ messages in thread From: Jeff King @ 2022-12-08 6:59 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: ZheNing Hu, Git List, Junio C Hamano, Christian Couder, johncai86, Taylor Blau On Thu, Dec 08, 2022 at 12:57:45AM +0100, Ævar Arnfjörð Bjarmason wrote: > Is this for a very large hosting site that's anywhere near GitHub, > GitLab's etc. scale? > > A "git gc" on a "live" repo is always racy in theory, but the odds that > you'll run into data corrupting trouble tends to approach zero as you > increase the gc.pruneExpire setting, with the default 2 weeks being more > than enough for even the most paranoid user. I'm a bit less optimistic than "tends to approach zero" here. The objects themselves might be older, but they may become referenced or unreferenced in an immediate racy way. E.g., intending to move branch Z to branch A, a client asks to add A and remove Z. Since there is no atomic view of the ref namespace, a simultaneous gc might see Z gone, but A not yet existing (and of course it could also be _two_ clients, etc). > The "cruft pack" facility does many different things, and my > understanding of it is that GitHub's not using it only as an end-run > around potential corruption issues, but that some not yet in tree > patches on top of it allow more aggressive "gc" without the fear of > corruption. I don't think cruft packs themselves help against corruption that much. For many years, GitHub used "repack -k" to just never expire objects. What cruft packs help with is: 1. They keep cruft objects out of the main pack, which reduces the costs of lookups and bitmaps for the main pack. 2. When you _do_ choose to expire, you can do so without worrying about accidentally exploding all of those old objects into loose ones (which is not wrong from a correctness point of view, but can have some amazingly bad performance characteristics). I think the bits you're thinking of on top are in v2.39. The "repack --expire-to" option lets you write objects that _would_ be deleted into a cruft pack, which can serve as a backup (but managing that is out of scope for repack itself, so you have to roll your own strategy there). -Peff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question: How to execute git-gc correctly on the git server 2022-12-08 6:59 ` Jeff King @ 2022-12-08 12:35 ` Ævar Arnfjörð Bjarmason 2022-12-14 20:11 ` Taylor Blau 0 siblings, 1 reply; 14+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-08 12:35 UTC (permalink / raw) To: Jeff King Cc: ZheNing Hu, Git List, Junio C Hamano, Christian Couder, johncai86, Taylor Blau On Thu, Dec 08 2022, Jeff King wrote: > On Thu, Dec 08, 2022 at 12:57:45AM +0100, Ævar Arnfjörð Bjarmason wrote: > >> Is this for a very large hosting site that's anywhere near GitHub, >> GitLab's etc. scale? >> >> A "git gc" on a "live" repo is always racy in theory, but the odds that >> you'll run into data corrupting trouble tends to approach zero as you >> increase the gc.pruneExpire setting, with the default 2 weeks being more >> than enough for even the most paranoid user. > > I'm a bit less optimistic than "tends to approach zero" here. The > objects themselves might be older, but they may become referenced or > unreferenced in an immediate racy way. E.g., intending to move branch Z > to branch A, a client asks to add A and remove Z. Since there is no > atomic view of the ref namespace, a simultaneous gc might see Z gone, > but A not yet existing (and of course it could also be _two_ clients, > etc). Yes, I'm really hand-waiving away the issue there for brevity. You've got a really good summary of why exactly that race happens somewhere in the list archive, which I'm not digging up now. But (and just for the general audience here, I know you know this) that race basically happens because "gc" concurrently decides that say a 2 week old blob containing "foo" isn't referenced, *and* we have a concurrent branch push that happens to reference a "foo" blob, along with a concurrent "gc". I have run into this once or twice in practice (with a very high volume in-house git server), and in those cases it was because person "B" doing the new push was using person's "A"'s work to push a new branch, after it had been ~gc.pruneExpire since the topic branch for "A" was simultaneously being deleted. In principle what I noted upthread is completely false, but I think in practice it's almost always true. I.e. users aren't pushing random blobs, and as time goes by the odds that the exact same content is re-pushed go down. It's also worth noting that some repositories are much more vulnerable to this than others. If you have predictable auto-generated content in your repo (think the package+version list some languages routinely carry in-tree) you're much more likely to run into this: Someone bumped that for a topic ~2 weeks ago, nobody else bothered on any of their branches, and then the "A"+"B" race described above happens. As some practical advice to those running "gc" on live repos: To easily mitigate this run expiry on the least busy hours of the day. Even for truly global development teams there's usually a big lull when it's high noon in the middle of the Pacific. >> The "cruft pack" facility does many different things, and my >> understanding of it is that GitHub's not using it only as an end-run >> around potential corruption issues, but that some not yet in tree >> patches on top of it allow more aggressive "gc" without the fear of >> corruption. > > I don't think cruft packs themselves help against corruption that much. > For many years, GitHub used "repack -k" to just never expire objects. > What cruft packs help with is: > > 1. They keep cruft objects out of the main pack, which reduces the > costs of lookups and bitmaps for the main pack. > > 2. When you _do_ choose to expire, you can do so without worrying > about accidentally exploding all of those old objects into loose > ones (which is not wrong from a correctness point of view, but can > have some amazingly bad performance characteristics). > > I think the bits you're thinking of on top are in v2.39. The "repack > --expire-to" option lets you write objects that _would_ be deleted into > a cruft pack, which can serve as a backup (but managing that is out of > scope for repack itself, so you have to roll your own strategy there). Yes, that's what I was referring to. I think I had feedback on that series saying that if held correctly this would also nicely solve that long-time race. Maybe I'm just misremembering, but I (mis?)recalled that Taylor indicated that it was being used like that at GitHub. Another thing that really helps to mitigate it is this never-in-tree patch of mine (which ran in busy production for years, and probably still): https://lore.kernel.org/git/20181028225023.26427-1-avarab@gmail.com/ It's sensitive to "transfer.unpackLimit" if it's going to help with that, and even if you always write duplicate content it won't fully help with the race, as "B" may have seen the old ref, and hence not sent the "foo" blob over (so the client would need to not have a copy of "A"'s about-to-be-deleted topic). All of which described a setup I ran it in, so *if* we ever ran into the race then most likely we'd just have written duplicate content in the incoming PACK, so we'd happily chug along. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Question: How to execute git-gc correctly on the git server 2022-12-08 12:35 ` Ævar Arnfjörð Bjarmason @ 2022-12-14 20:11 ` Taylor Blau 0 siblings, 0 replies; 14+ messages in thread From: Taylor Blau @ 2022-12-14 20:11 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Jeff King, ZheNing Hu, Git List, Junio C Hamano, Christian Couder, johncai86 On Thu, Dec 08, 2022 at 01:35:04PM +0100, Ævar Arnfjörð Bjarmason wrote: > >> The "cruft pack" facility does many different things, and my > >> understanding of it is that GitHub's not using it only as an end-run > >> around potential corruption issues, but that some not yet in tree > >> patches on top of it allow more aggressive "gc" without the fear of > >> corruption. > > > > I don't think cruft packs themselves help against corruption that much. > > For many years, GitHub used "repack -k" to just never expire objects. > > What cruft packs help with is: > > > > 1. They keep cruft objects out of the main pack, which reduces the > > costs of lookups and bitmaps for the main pack. Peff isn't wrong here, but there is a big caveat which is that this is only true when using a single pack bitmap. Single pack bitmaps are guaranteed to have reachability closure over their objects, but writing a MIDX bitmap after generating the MIDX does not afford us the same guarantees. So if you have a cruft pack which contains some unreachable object X, which is made reachable by some other object that *is* reachable from some reference, *and that* object is included in one of the MIDX's packs, then we won't have reachability closure unless we also bitmap the cruft pack, too. So even though it helps a lot with bitmapping in the single-pack case, in practice it doesn't make a significant difference with multi-pack bitmaps. > > 2. When you _do_ choose to expire, you can do so without worrying > > about accidentally exploding all of those old objects into loose > > ones (which is not wrong from a correctness point of view, but can > > have some amazingly bad performance characteristics). > > > > I think the bits you're thinking of on top are in v2.39. The "repack > > --expire-to" option lets you write objects that _would_ be deleted into > > a cruft pack, which can serve as a backup (but managing that is out of > > scope for repack itself, so you have to roll your own strategy there). > > Yes, that's what I was referring to. Yes, we use the `--expire-to` option when doing a pruning GC to move the expired objects out of the repo to some "../backup.git" location. The out-of-tree tools that Ævar is speculating is basically running `cat-file --batch` in the backup repo, feeding it the list of missing objects, and then writing those objects (back) into the GC'd repository. > I think I had feedback on that series saying that if held correctly this > would also nicely solve that long-time race. Maybe I'm just > misremembering, but I (mis?)recalled that Taylor indicated that it was > being used like that at GitHub. It (the above) doesn't solve the race, but it does make it easier to recover from a corrupt repository when we lose that race. Thanks, Taylor ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2022-12-14 20:25 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-12-07 15:58 Question: How to execute git-gc correctly on the git server ZheNing Hu 2022-12-07 23:57 ` Ævar Arnfjörð Bjarmason 2022-12-08 1:16 ` Michal Suchánek 2022-12-08 7:01 ` Jeff King 2022-12-09 0:49 ` Michal Suchánek 2022-12-09 1:37 ` Jeff King 2022-12-09 7:26 ` ZheNing Hu 2022-12-09 13:48 ` Ævar Arnfjörð Bjarmason 2022-12-11 16:01 ` ZheNing Hu 2022-12-11 16:27 ` Michal Suchánek 2022-12-09 7:15 ` ZheNing Hu 2022-12-08 6:59 ` Jeff King 2022-12-08 12:35 ` Ævar Arnfjörð Bjarmason 2022-12-14 20:11 ` Taylor Blau
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).