git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Question: How to execute git-gc correctly on the git server
@ 2022-12-07 15:58 ZheNing Hu
  2022-12-07 23:57 ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 14+ messages in thread
From: ZheNing Hu @ 2022-12-07 15:58 UTC (permalink / raw)
  To: Git List
  Cc: Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Christian Couder, johncai86, Taylor Blau

Hi,

I would like to run git gc on my git server periodically, which should help
reduce storage space and optimize the read performance of the repository.
I know github, gitlab all have this process...

But the concurrency between git gc and other git commands is holding
me back a bit.

git-gc [1] docs say:

    On the other hand, when git gc runs concurrently with another process,
    there is a risk of it deleting an object that the other process is using but
    hasn’t created a reference to. This may just cause the other process to
    fail or may corrupt the repository if the other process later adds
a reference
    to the deleted object.

It seems that git gc is a dangerous operation that may cause data corruption
concurrently with other git commands.

Then I read the contents of Github's blog [2], git gc ---cruft seems to be used
to keep those expiring unreachable objects in a cruft pack, but the blog says
github use some special "limbo" repository to keep the cruft pack for git data
recover. Well, a lot of the details here are pretty hard to understand for me :(

However, on the other hand, my git server is still at v2.35, and --cruft was
introduced in v2.38, so I'm actually more curious about: how did the server
execute git gc correctly in the past? Do we need a repository level "big lock"
that blocks most/all other git operations? What should the behavior of users'
git clone/push be at this time? Report error that the git server is performing
git gc? Or just wait for git gc to complete?

Thanks for any comments and help!

[1]: https://git-scm.com/docs/git-gc
[2]: https://github.blog/2022-09-13-scaling-gits-garbage-collection/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question: How to execute git-gc correctly on the git server
  2022-12-07 15:58 Question: How to execute git-gc correctly on the git server ZheNing Hu
@ 2022-12-07 23:57 ` Ævar Arnfjörð Bjarmason
  2022-12-08  1:16   ` Michal Suchánek
  2022-12-08  6:59   ` Jeff King
  0 siblings, 2 replies; 14+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-07 23:57 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Git List, Junio C Hamano, Christian Couder, johncai86,
	Taylor Blau


On Wed, Dec 07 2022, ZheNing Hu wrote:

> I would like to run git gc on my git server periodically, which should help
> reduce storage space and optimize the read performance of the repository.
> I know github, gitlab all have this process...
>
> But the concurrency between git gc and other git commands is holding
> me back a bit.
>
> git-gc [1] docs say:
>
>     On the other hand, when git gc runs concurrently with another process,
>     there is a risk of it deleting an object that the other process is using but
>     hasn’t created a reference to. This may just cause the other process to
>     fail or may corrupt the repository if the other process later adds
> a reference
>     to the deleted object.
>
> It seems that git gc is a dangerous operation that may cause data corruption
> concurrently with other git commands.
>
> Then I read the contents of Github's blog [2], git gc ---cruft seems to be used
> to keep those expiring unreachable objects in a cruft pack, but the blog says
> github use some special "limbo" repository to keep the cruft pack for git data
> recover. Well, a lot of the details here are pretty hard to understand for me :(
>
> However, on the other hand, my git server is still at v2.35, and --cruft was
> introduced in v2.38, so I'm actually more curious about: how did the server
> execute git gc correctly in the past? Do we need a repository level "big lock"
> that blocks most/all other git operations? What should the behavior of users'
> git clone/push be at this time? Report error that the git server is performing
> git gc? Or just wait for git gc to complete?
>
> Thanks for any comments and help!
>
> [1]: https://git-scm.com/docs/git-gc
> [2]: https://github.blog/2022-09-13-scaling-gits-garbage-collection/

Is this for a very large hosting site that's anywhere near GitHub,
GitLab's etc. scale?

A "git gc" on a "live" repo is always racy in theory, but the odds that
you'll run into data corrupting trouble tends to approach zero as you
increase the gc.pruneExpire setting, with the default 2 weeks being more
than enough for even the most paranoid user.

The "cruft pack" facility does many different things, and my
understanding of it is that GitHub's not using it only as an end-run
around potential corruption issues, but that some not yet in tree
patches on top of it allow more aggressive "gc" without the fear of
corruption.

So, I think you probably don't need to worry about it. Other major
hosting sites do run "git gc" on live repositories, but as always take
backups etc.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question: How to execute git-gc correctly on the git server
  2022-12-07 23:57 ` Ævar Arnfjörð Bjarmason
@ 2022-12-08  1:16   ` Michal Suchánek
  2022-12-08  7:01     ` Jeff King
  2022-12-09  7:15     ` ZheNing Hu
  2022-12-08  6:59   ` Jeff King
  1 sibling, 2 replies; 14+ messages in thread
From: Michal Suchánek @ 2022-12-08  1:16 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: ZheNing Hu, Git List, Junio C Hamano, Christian Couder, johncai86,
	Taylor Blau

On Thu, Dec 08, 2022 at 12:57:45AM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Dec 07 2022, ZheNing Hu wrote:
> 
> > I would like to run git gc on my git server periodically, which should help
> > reduce storage space and optimize the read performance of the repository.
> > I know github, gitlab all have this process...
> >
> > But the concurrency between git gc and other git commands is holding
> > me back a bit.
> >
> > git-gc [1] docs say:
> >
> >     On the other hand, when git gc runs concurrently with another process,
> >     there is a risk of it deleting an object that the other process is using but
> >     hasn’t created a reference to. This may just cause the other process to
> >     fail or may corrupt the repository if the other process later adds
> > a reference
> >     to the deleted object.
> >
> > It seems that git gc is a dangerous operation that may cause data corruption
> > concurrently with other git commands.
> >
> > Then I read the contents of Github's blog [2], git gc ---cruft seems to be used
> > to keep those expiring unreachable objects in a cruft pack, but the blog says
> > github use some special "limbo" repository to keep the cruft pack for git data
> > recover. Well, a lot of the details here are pretty hard to understand for me :(
> >
> > However, on the other hand, my git server is still at v2.35, and --cruft was
> > introduced in v2.38, so I'm actually more curious about: how did the server
> > execute git gc correctly in the past? Do we need a repository level "big lock"
> > that blocks most/all other git operations? What should the behavior of users'
> > git clone/push be at this time? Report error that the git server is performing
> > git gc? Or just wait for git gc to complete?
> >
> > Thanks for any comments and help!
> >
> > [1]: https://git-scm.com/docs/git-gc
> > [2]: https://github.blog/2022-09-13-scaling-gits-garbage-collection/
> 
> Is this for a very large hosting site that's anywhere near GitHub,
> GitLab's etc. scale?
> 
> A "git gc" on a "live" repo is always racy in theory, but the odds that
> you'll run into data corrupting trouble tends to approach zero as you
> increase the gc.pruneExpire setting, with the default 2 weeks being more
> than enough for even the most paranoid user.

And that two weeks expiration applies to what, exactly?

For commits there is author date and commit date but many other objecs
won't have these I suppose. And the date when the object is pushed into
the repository is unrelated to these two, anyway.

> So, I think you probably don't need to worry about it. Other major
> hosting sites do run "git gc" on live repositories, but as always take
> backups etc.

Actually, it is a real problem. With <100 users and some scripting I got
unexplained repository corruptions which went away when gc was disabled.

YMMV

Bad locking design is always a landmine waiting to get triggered. If you
step carefully you might avoid it for some time.

Thanks

Michal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question: How to execute git-gc correctly on the git server
  2022-12-07 23:57 ` Ævar Arnfjörð Bjarmason
  2022-12-08  1:16   ` Michal Suchánek
@ 2022-12-08  6:59   ` Jeff King
  2022-12-08 12:35     ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff King @ 2022-12-08  6:59 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: ZheNing Hu, Git List, Junio C Hamano, Christian Couder, johncai86,
	Taylor Blau

On Thu, Dec 08, 2022 at 12:57:45AM +0100, Ævar Arnfjörð Bjarmason wrote:

> Is this for a very large hosting site that's anywhere near GitHub,
> GitLab's etc. scale?
> 
> A "git gc" on a "live" repo is always racy in theory, but the odds that
> you'll run into data corrupting trouble tends to approach zero as you
> increase the gc.pruneExpire setting, with the default 2 weeks being more
> than enough for even the most paranoid user.

I'm a bit less optimistic than "tends to approach zero" here. The
objects themselves might be older, but they may become referenced or
unreferenced in an immediate racy way. E.g., intending to move branch Z
to branch A, a client asks to add A and remove Z. Since there is no
atomic view of the ref namespace, a simultaneous gc might see Z gone,
but A not yet existing (and of course it could also be _two_ clients,
etc).

> The "cruft pack" facility does many different things, and my
> understanding of it is that GitHub's not using it only as an end-run
> around potential corruption issues, but that some not yet in tree
> patches on top of it allow more aggressive "gc" without the fear of
> corruption.

I don't think cruft packs themselves help against corruption that much.
For many years, GitHub used "repack -k" to just never expire objects.
What cruft packs help with is:

  1. They keep cruft objects out of the main pack, which reduces the
     costs of lookups and bitmaps for the main pack.

  2. When you _do_ choose to expire, you can do so without worrying
     about accidentally exploding all of those old objects into loose
     ones (which is not wrong from a correctness point of view, but can
     have some amazingly bad performance characteristics).

I think the bits you're thinking of on top are in v2.39. The "repack
--expire-to" option lets you write objects that _would_ be deleted into
a cruft pack, which can serve as a backup (but managing that is out of
scope for repack itself, so you have to roll your own strategy there).

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question: How to execute git-gc correctly on the git server
  2022-12-08  1:16   ` Michal Suchánek
@ 2022-12-08  7:01     ` Jeff King
  2022-12-09  0:49       ` Michal Suchánek
  2022-12-09  7:15     ` ZheNing Hu
  1 sibling, 1 reply; 14+ messages in thread
From: Jeff King @ 2022-12-08  7:01 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Ævar Arnfjörð Bjarmason, ZheNing Hu, Git List,
	Junio C Hamano, Christian Couder, johncai86, Taylor Blau

On Thu, Dec 08, 2022 at 02:16:31AM +0100, Michal Suchánek wrote:

> > A "git gc" on a "live" repo is always racy in theory, but the odds that
> > you'll run into data corrupting trouble tends to approach zero as you
> > increase the gc.pruneExpire setting, with the default 2 weeks being more
> > than enough for even the most paranoid user.
> 
> And that two weeks expiration applies to what, exactly?
> 
> For commits there is author date and commit date but many other objecs
> won't have these I suppose. And the date when the object is pushed into
> the repository is unrelated to these two, anyway.

In this case it's the mtime on the object file (or the pack containing
it). But yes, it is far from a complete race-free solution.

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question: How to execute git-gc correctly on the git server
  2022-12-08  6:59   ` Jeff King
@ 2022-12-08 12:35     ` Ævar Arnfjörð Bjarmason
  2022-12-14 20:11       ` Taylor Blau
  0 siblings, 1 reply; 14+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-08 12:35 UTC (permalink / raw)
  To: Jeff King
  Cc: ZheNing Hu, Git List, Junio C Hamano, Christian Couder, johncai86,
	Taylor Blau


On Thu, Dec 08 2022, Jeff King wrote:

> On Thu, Dec 08, 2022 at 12:57:45AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> Is this for a very large hosting site that's anywhere near GitHub,
>> GitLab's etc. scale?
>> 
>> A "git gc" on a "live" repo is always racy in theory, but the odds that
>> you'll run into data corrupting trouble tends to approach zero as you
>> increase the gc.pruneExpire setting, with the default 2 weeks being more
>> than enough for even the most paranoid user.
>
> I'm a bit less optimistic than "tends to approach zero" here. The
> objects themselves might be older, but they may become referenced or
> unreferenced in an immediate racy way. E.g., intending to move branch Z
> to branch A, a client asks to add A and remove Z. Since there is no
> atomic view of the ref namespace, a simultaneous gc might see Z gone,
> but A not yet existing (and of course it could also be _two_ clients,
> etc).

Yes, I'm really hand-waiving away the issue there for brevity.

You've got a really good summary of why exactly that race happens
somewhere in the list archive, which I'm not digging up now.

But (and just for the general audience here, I know you know this) that
race basically happens because "gc" concurrently decides that say a 2
week old blob containing "foo" isn't referenced, *and* we have a
concurrent branch push that happens to reference a "foo" blob, along
with a concurrent "gc".

I have run into this once or twice in practice (with a very high volume
in-house git server), and in those cases it was because person "B" doing
the new push was using person's "A"'s work to push a new branch, after it
had been ~gc.pruneExpire since the topic branch for "A" was
simultaneously being deleted.

In principle what I noted upthread is completely false, but I think in
practice it's almost always true. I.e. users aren't pushing random
blobs, and as time goes by the odds that the exact same content is
re-pushed go down.

It's also worth noting that some repositories are much more vulnerable
to this than others.

If you have predictable auto-generated content in your repo (think the
package+version list some languages routinely carry in-tree) you're much
more likely to run into this: Someone bumped that for a topic ~2 weeks
ago, nobody else bothered on any of their branches, and then the "A"+"B"
race described above happens.

As some practical advice to those running "gc" on live repos: To easily
mitigate this run expiry on the least busy hours of the day. Even for
truly global development teams there's usually a big lull when it's high
noon in the middle of the Pacific.

>> The "cruft pack" facility does many different things, and my
>> understanding of it is that GitHub's not using it only as an end-run
>> around potential corruption issues, but that some not yet in tree
>> patches on top of it allow more aggressive "gc" without the fear of
>> corruption.
>
> I don't think cruft packs themselves help against corruption that much.
> For many years, GitHub used "repack -k" to just never expire objects.
> What cruft packs help with is:
>
>   1. They keep cruft objects out of the main pack, which reduces the
>      costs of lookups and bitmaps for the main pack.
>
>   2. When you _do_ choose to expire, you can do so without worrying
>      about accidentally exploding all of those old objects into loose
>      ones (which is not wrong from a correctness point of view, but can
>      have some amazingly bad performance characteristics).
>
> I think the bits you're thinking of on top are in v2.39. The "repack
> --expire-to" option lets you write objects that _would_ be deleted into
> a cruft pack, which can serve as a backup (but managing that is out of
> scope for repack itself, so you have to roll your own strategy there).

Yes, that's what I was referring to.

I think I had feedback on that series saying that if held correctly this
would also nicely solve that long-time race. Maybe I'm just
misremembering, but I (mis?)recalled that Taylor indicated that it was
being used like that at GitHub.

Another thing that really helps to mitigate it is this never-in-tree
patch of mine (which ran in busy production for years, and probably
still):
https://lore.kernel.org/git/20181028225023.26427-1-avarab@gmail.com/

It's sensitive to "transfer.unpackLimit" if it's going to help with
that, and even if you always write duplicate content it won't fully help
with the race, as "B" may have seen the old ref, and hence not sent the
"foo" blob over (so the client would need to not have a copy of "A"'s
about-to-be-deleted topic).

All of which described a setup I ran it in, so *if* we ever ran into the
race then most likely we'd just have written duplicate content in the
incoming PACK, so we'd happily chug along.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question: How to execute git-gc correctly on the git server
  2022-12-08  7:01     ` Jeff King
@ 2022-12-09  0:49       ` Michal Suchánek
  2022-12-09  1:37         ` Jeff King
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Suchánek @ 2022-12-09  0:49 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, ZheNing Hu, Git List,
	Junio C Hamano, Christian Couder, johncai86, Taylor Blau

On Thu, Dec 08, 2022 at 02:01:05AM -0500, Jeff King wrote:
> On Thu, Dec 08, 2022 at 02:16:31AM +0100, Michal Suchánek wrote:
> 
> > > A "git gc" on a "live" repo is always racy in theory, but the odds that
> > > you'll run into data corrupting trouble tends to approach zero as you
> > > increase the gc.pruneExpire setting, with the default 2 weeks being more
> > > than enough for even the most paranoid user.
> > 
> > And that two weeks expiration applies to what, exactly?
> > 
> > For commits there is author date and commit date but many other objecs
> > won't have these I suppose. And the date when the object is pushed into
> > the repository is unrelated to these two, anyway.
> 
> In this case it's the mtime on the object file (or the pack containing
> it). But yes, it is far from a complete race-free solution.

So if you are pushing a branch that happens to reuse commits or other
objects from an earlier branh that might have been collected ín the
meantime you are basically doomed.

How likely that is depends a lot on your workflow.

People deleting a branch and then pushing another variant in which many
objects are the same is a risk.

People exporting files from somewhere and adding them to the repo which
are bit-identical when independently exported by multiple people and
sometimes deleting branches is a risk.

...

Thanks

Michal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question: How to execute git-gc correctly on the git server
  2022-12-09  0:49       ` Michal Suchánek
@ 2022-12-09  1:37         ` Jeff King
  2022-12-09  7:26           ` ZheNing Hu
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2022-12-09  1:37 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Ævar Arnfjörð Bjarmason, ZheNing Hu, Git List,
	Junio C Hamano, Christian Couder, johncai86, Taylor Blau

On Fri, Dec 09, 2022 at 01:49:18AM +0100, Michal Suchánek wrote:

> > In this case it's the mtime on the object file (or the pack containing
> > it). But yes, it is far from a complete race-free solution.
> 
> So if you are pushing a branch that happens to reuse commits or other
> objects from an earlier branh that might have been collected ín the
> meantime you are basically doomed.

Basically yes. We do "freshen" the mtimes on object files when we omit
an object write (e.g., your index state ends up at the same tree as an
old one). But for a push, there is no freshening. We check the graph at
the time of the push and decide if we have everything we need (either
newly pushed, or from what we already had in the repo). And that is
what's racy; somebody might be deleting as that check is happening.

> People deleting a branch and then pushing another variant in which many
> objects are the same is a risk.
> 
> People exporting files from somewhere and adding them to the repo which
> are bit-identical when independently exported by multiple people and
> sometimes deleting branches is a risk.

Yes, both of those are risky (along with many other variants).

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question: How to execute git-gc correctly on the git server
  2022-12-08  1:16   ` Michal Suchánek
  2022-12-08  7:01     ` Jeff King
@ 2022-12-09  7:15     ` ZheNing Hu
  1 sibling, 0 replies; 14+ messages in thread
From: ZheNing Hu @ 2022-12-09  7:15 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Ævar Arnfjörð Bjarmason, Git List, Junio C Hamano,
	Christian Couder, johncai86, Taylor Blau

Michal Suchánek <msuchanek@suse.de> 于2022年12月8日周四 09:16写道:
>
> On Thu, Dec 08, 2022 at 12:57:45AM +0100, Ævar Arnfjörð Bjarmason wrote:
> >
> > On Wed, Dec 07 2022, ZheNing Hu wrote:
> >
> > > I would like to run git gc on my git server periodically, which should help
> > > reduce storage space and optimize the read performance of the repository.
> > > I know github, gitlab all have this process...
> > >
> > > But the concurrency between git gc and other git commands is holding
> > > me back a bit.
> > >
> > > git-gc [1] docs say:
> > >
> > >     On the other hand, when git gc runs concurrently with another process,
> > >     there is a risk of it deleting an object that the other process is using but
> > >     hasn’t created a reference to. This may just cause the other process to
> > >     fail or may corrupt the repository if the other process later adds
> > > a reference
> > >     to the deleted object.
> > >
> > > It seems that git gc is a dangerous operation that may cause data corruption
> > > concurrently with other git commands.
> > >
> > > Then I read the contents of Github's blog [2], git gc ---cruft seems to be used
> > > to keep those expiring unreachable objects in a cruft pack, but the blog says
> > > github use some special "limbo" repository to keep the cruft pack for git data
> > > recover. Well, a lot of the details here are pretty hard to understand for me :(
> > >
> > > However, on the other hand, my git server is still at v2.35, and --cruft was
> > > introduced in v2.38, so I'm actually more curious about: how did the server
> > > execute git gc correctly in the past? Do we need a repository level "big lock"
> > > that blocks most/all other git operations? What should the behavior of users'
> > > git clone/push be at this time? Report error that the git server is performing
> > > git gc? Or just wait for git gc to complete?
> > >
> > > Thanks for any comments and help!
> > >
> > > [1]: https://git-scm.com/docs/git-gc
> > > [2]: https://github.blog/2022-09-13-scaling-gits-garbage-collection/
> >
> > Is this for a very large hosting site that's anywhere near GitHub,
> > GitLab's etc. scale?
> >
> > A "git gc" on a "live" repo is always racy in theory, but the odds that
> > you'll run into data corrupting trouble tends to approach zero as you
> > increase the gc.pruneExpire setting, with the default 2 weeks being more
> > than enough for even the most paranoid user.
>
> And that two weeks expiration applies to what, exactly?
>
> For commits there is author date and commit date but many other objecs
> won't have these I suppose. And the date when the object is pushed into
> the repository is unrelated to these two, anyway.
>
> > So, I think you probably don't need to worry about it. Other major
> > hosting sites do run "git gc" on live repositories, but as always take
> > backups etc.
>
> Actually, it is a real problem. With <100 users and some scripting I got
> unexplained repository corruptions which went away when gc was disabled.
>
> YMMV
>
> Bad locking design is always a landmine waiting to get triggered. If you
> step carefully you might avoid it for some time.
>

I agree with this. What I hope to be able to do more is "no error at all"
rather than "small probability of error"

> Thanks
>
> Michal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question: How to execute git-gc correctly on the git server
  2022-12-09  1:37         ` Jeff King
@ 2022-12-09  7:26           ` ZheNing Hu
  2022-12-09 13:48             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 14+ messages in thread
From: ZheNing Hu @ 2022-12-09  7:26 UTC (permalink / raw)
  To: Jeff King
  Cc: Michal Suchánek, Ævar Arnfjörð Bjarmason,
	Git List, Junio C Hamano, Christian Couder, johncai86,
	Taylor Blau

Jeff King <peff@peff.net> 于2022年12月9日周五 09:37写道:
>
> On Fri, Dec 09, 2022 at 01:49:18AM +0100, Michal Suchánek wrote:
>
> > > In this case it's the mtime on the object file (or the pack containing
> > > it). But yes, it is far from a complete race-free solution.
> >
> > So if you are pushing a branch that happens to reuse commits or other
> > objects from an earlier branh that might have been collected ín the
> > meantime you are basically doomed.
>
> Basically yes. We do "freshen" the mtimes on object files when we omit
> an object write (e.g., your index state ends up at the same tree as an
> old one). But for a push, there is no freshening. We check the graph at
> the time of the push and decide if we have everything we need (either
> newly pushed, or from what we already had in the repo). And that is
> what's racy; somebody might be deleting as that check is happening.
>
> > People deleting a branch and then pushing another variant in which many
> > objects are the same is a risk.
> >
> > People exporting files from somewhere and adding them to the repo which
> > are bit-identical when independently exported by multiple people and
> > sometimes deleting branches is a risk.
>
> Yes, both of those are risky (along with many other variants).
>

I'm wondering if there's an easy and poor performance way to do
gc safely? For example, add a file lock to the repository during
git push and git gc?

> -Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question: How to execute git-gc correctly on the git server
  2022-12-09  7:26           ` ZheNing Hu
@ 2022-12-09 13:48             ` Ævar Arnfjörð Bjarmason
  2022-12-11 16:01               ` ZheNing Hu
  0 siblings, 1 reply; 14+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-09 13:48 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Jeff King, Michal Suchánek, Git List, Junio C Hamano,
	Christian Couder, johncai86, Taylor Blau


On Fri, Dec 09 2022, ZheNing Hu wrote:

> Jeff King <peff@peff.net> 于2022年12月9日周五 09:37写道:
>>
>> On Fri, Dec 09, 2022 at 01:49:18AM +0100, Michal Suchánek wrote:
>>
>> > > In this case it's the mtime on the object file (or the pack containing
>> > > it). But yes, it is far from a complete race-free solution.
>> >
>> > So if you are pushing a branch that happens to reuse commits or other
>> > objects from an earlier branh that might have been collected ín the
>> > meantime you are basically doomed.
>>
>> Basically yes. We do "freshen" the mtimes on object files when we omit
>> an object write (e.g., your index state ends up at the same tree as an
>> old one). But for a push, there is no freshening. We check the graph at
>> the time of the push and decide if we have everything we need (either
>> newly pushed, or from what we already had in the repo). And that is
>> what's racy; somebody might be deleting as that check is happening.
>>
>> > People deleting a branch and then pushing another variant in which many
>> > objects are the same is a risk.
>> >
>> > People exporting files from somewhere and adding them to the repo which
>> > are bit-identical when independently exported by multiple people and
>> > sometimes deleting branches is a risk.
>>
>> Yes, both of those are risky (along with many other variants).
>>
>
> I'm wondering if there's an easy and poor performance way to do
> gc safely? For example, add a file lock to the repository during
> git push and git gc?

We don't have any "easy" way to do it, but we probably should. The root
cause of the race is tricky to fix, and we don't have any "global ref
lock".

But in the context of a client<->server and wanting to do gc on the
server a good enough and easy solution would be e.g.:

 1. Have a {pre,post}-receive hook logging attempted/finished pushes
 2. Have the pre-receive hook able to reject (or better yet, hang with
    sleep()) incoming deletions
 3. Do a gc with a small wrapper script, which:
    - Flips the "no deletion ops now" (or "delay deletion ops") switch
    - Polls until it's sure there's no relevant in-progress operations
    - Do a full gc
    - Unlock

You'd need to be certain that all relevant repo operations are going
through git-receive-pack etc., i.e. a local "git branch -d" or the like
won't run {pre,post}-receive.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question: How to execute git-gc correctly on the git server
  2022-12-09 13:48             ` Ævar Arnfjörð Bjarmason
@ 2022-12-11 16:01               ` ZheNing Hu
  2022-12-11 16:27                 ` Michal Suchánek
  0 siblings, 1 reply; 14+ messages in thread
From: ZheNing Hu @ 2022-12-11 16:01 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, Michal Suchánek, Git List, Junio C Hamano,
	Christian Couder, johncai86, Taylor Blau

Ævar Arnfjörð Bjarmason <avarab@gmail.com> 于2022年12月9日周五 21:52写道:

>
>
> On Fri, Dec 09 2022, ZheNing Hu wrote:
>
> > Jeff King <peff@peff.net> 于2022年12月9日周五 09:37写道:
> >>
> >> On Fri, Dec 09, 2022 at 01:49:18AM +0100, Michal Suchánek wrote:
> >>
> >> > > In this case it's the mtime on the object file (or the pack containing
> >> > > it). But yes, it is far from a complete race-free solution.
> >> >
> >> > So if you are pushing a branch that happens to reuse commits or other
> >> > objects from an earlier branh that might have been collected ín the
> >> > meantime you are basically doomed.
> >>
> >> Basically yes. We do "freshen" the mtimes on object files when we omit
> >> an object write (e.g., your index state ends up at the same tree as an
> >> old one). But for a push, there is no freshening. We check the graph at
> >> the time of the push and decide if we have everything we need (either
> >> newly pushed, or from what we already had in the repo). And that is
> >> what's racy; somebody might be deleting as that check is happening.
> >>
> >> > People deleting a branch and then pushing another variant in which many
> >> > objects are the same is a risk.
> >> >
> >> > People exporting files from somewhere and adding them to the repo which
> >> > are bit-identical when independently exported by multiple people and
> >> > sometimes deleting branches is a risk.
> >>
> >> Yes, both of those are risky (along with many other variants).
> >>
> >
> > I'm wondering if there's an easy and poor performance way to do
> > gc safely? For example, add a file lock to the repository during
> > git push and git gc?
>
> We don't have any "easy" way to do it, but we probably should. The root
> cause of the race is tricky to fix, and we don't have any "global ref
> lock".
>
> But in the context of a client<->server and wanting to do gc on the
> server a good enough and easy solution would be e.g.:
>
>  1. Have a {pre,post}-receive hook logging attempted/finished pushes
>  2. Have the pre-receive hook able to reject (or better yet, hang with
>     sleep()) incoming deletions
>  3. Do a gc with a small wrapper script, which:
>     - Flips the "no deletion ops now" (or "delay deletion ops") switch
>     - Polls until it's sure there's no relevant in-progress operations
>     - Do a full gc
>     - Unlock
>

Well, I understand that after the branch is deleted, some objects may be
unreachable, and then these objects are deleted by concurrent git gc,
which leads to data corruption in concurrent git push if these objects need
to be referenced, but what I don't understand that is it enough to just block
the operation of deleting branches? Once gc happens to be deleting an
unreachable object, and git push new branch (git receive-pack) happens to
need it, won't it still go wrong?

> You'd need to be certain that all relevant repo operations are going
> through git-receive-pack etc., i.e. a local "git branch -d" or the like
> won't run {pre,post}-receive.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question: How to execute git-gc correctly on the git server
  2022-12-11 16:01               ` ZheNing Hu
@ 2022-12-11 16:27                 ` Michal Suchánek
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Suchánek @ 2022-12-11 16:27 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Ævar Arnfjörð Bjarmason, Jeff King, Git List,
	Junio C Hamano, Christian Couder, johncai86, Taylor Blau

On Mon, Dec 12, 2022 at 12:01:51AM +0800, ZheNing Hu wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> 于2022年12月9日周五 21:52写道:
>
> >
> >
> > On Fri, Dec 09 2022, ZheNing Hu wrote:
> >
> > > Jeff King <peff@peff.net> 于2022年12月9日周五 09:37写道:
> > >>
> > >> On Fri, Dec 09, 2022 at 01:49:18AM +0100, Michal Suchánek wrote:
> > >>
> > >> > > In this case it's the mtime on the object file (or the pack containing
> > >> > > it). But yes, it is far from a complete race-free solution.
> > >> >
> > >> > So if you are pushing a branch that happens to reuse commits or other
> > >> > objects from an earlier branh that might have been collected ín the
> > >> > meantime you are basically doomed.
> > >>
> > >> Basically yes. We do "freshen" the mtimes on object files when we omit
> > >> an object write (e.g., your index state ends up at the same tree as an
> > >> old one). But for a push, there is no freshening. We check the graph at
> > >> the time of the push and decide if we have everything we need (either
> > >> newly pushed, or from what we already had in the repo). And that is
> > >> what's racy; somebody might be deleting as that check is happening.
> > >>
> > >> > People deleting a branch and then pushing another variant in which many
> > >> > objects are the same is a risk.
> > >> >
> > >> > People exporting files from somewhere and adding them to the repo which
> > >> > are bit-identical when independently exported by multiple people and
> > >> > sometimes deleting branches is a risk.
> > >>
> > >> Yes, both of those are risky (along with many other variants).
> > >>
> > >
> > > I'm wondering if there's an easy and poor performance way to do
> > > gc safely? For example, add a file lock to the repository during
> > > git push and git gc?
> >
> > We don't have any "easy" way to do it, but we probably should. The root
> > cause of the race is tricky to fix, and we don't have any "global ref
> > lock".
> >
> > But in the context of a client<->server and wanting to do gc on the
> > server a good enough and easy solution would be e.g.:
> >
> >  1. Have a {pre,post}-receive hook logging attempted/finished pushes
> >  2. Have the pre-receive hook able to reject (or better yet, hang with
> >     sleep()) incoming deletions
> >  3. Do a gc with a small wrapper script, which:
> >     - Flips the "no deletion ops now" (or "delay deletion ops") switch
> >     - Polls until it's sure there's no relevant in-progress operations
> >     - Do a full gc
> >     - Unlock
> >
>
> Well, I understand that after the branch is deleted, some objects may be
> unreachable, and then these objects are deleted by concurrent git gc,
> which leads to data corruption in concurrent git push if these objects need
> to be referenced, but what I don't understand that is it enough to just block
> the operation of deleting branches? Once gc happens to be deleting an
> unreachable object, and git push new branch (git receive-pack) happens to
> need it, won't it still go wrong?

As I understand the problem:

 - push figures out which objects are missing on the remote end
 - push starts sending the missing objects
 - remote gc deletes objects that are not reachable but push assumes remote has them
   - these might be part of a branch deleted long before gc started
 - push finishes and branch is advanced to point to an object that
   references objects that were deleted by gc

-> repository is corrupted

The only way to prevent this is to not delete anything ever, or to make
sure that objects that are part of any ongoing operation are always
referenced.

Which would probably mean in practice that any operation adding objects
needs to add temporary references to any objects it creates or aims to
reference, and/or check reachability of referenced objects once the
final object is created.

Thanks

Michal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Question: How to execute git-gc correctly on the git server
  2022-12-08 12:35     ` Ævar Arnfjörð Bjarmason
@ 2022-12-14 20:11       ` Taylor Blau
  0 siblings, 0 replies; 14+ messages in thread
From: Taylor Blau @ 2022-12-14 20:11 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, ZheNing Hu, Git List, Junio C Hamano, Christian Couder,
	johncai86

On Thu, Dec 08, 2022 at 01:35:04PM +0100, Ævar Arnfjörð Bjarmason wrote:
> >> The "cruft pack" facility does many different things, and my
> >> understanding of it is that GitHub's not using it only as an end-run
> >> around potential corruption issues, but that some not yet in tree
> >> patches on top of it allow more aggressive "gc" without the fear of
> >> corruption.
> >
> > I don't think cruft packs themselves help against corruption that much.
> > For many years, GitHub used "repack -k" to just never expire objects.
> > What cruft packs help with is:
> >
> >   1. They keep cruft objects out of the main pack, which reduces the
> >      costs of lookups and bitmaps for the main pack.

Peff isn't wrong here, but there is a big caveat which is that this is
only true when using a single pack bitmap. Single pack bitmaps are
guaranteed to have reachability closure over their objects, but writing
a MIDX bitmap after generating the MIDX does not afford us the same
guarantees.

So if you have a cruft pack which contains some unreachable object X,
which is made reachable by some other object that *is* reachable from
some reference, *and that* object is included in one of the MIDX's
packs, then we won't have reachability closure unless we also bitmap the
cruft pack, too.

So even though it helps a lot with bitmapping in the single-pack case,
in practice it doesn't make a significant difference with multi-pack
bitmaps.

> >   2. When you _do_ choose to expire, you can do so without worrying
> >      about accidentally exploding all of those old objects into loose
> >      ones (which is not wrong from a correctness point of view, but can
> >      have some amazingly bad performance characteristics).
> >
> > I think the bits you're thinking of on top are in v2.39. The "repack
> > --expire-to" option lets you write objects that _would_ be deleted into
> > a cruft pack, which can serve as a backup (but managing that is out of
> > scope for repack itself, so you have to roll your own strategy there).
>
> Yes, that's what I was referring to.

Yes, we use the `--expire-to` option when doing a pruning GC to move the
expired objects out of the repo to some "../backup.git" location. The
out-of-tree tools that Ævar is speculating is basically running
`cat-file --batch` in the backup repo, feeding it the list of missing
objects, and then writing those objects (back) into the GC'd repository.

> I think I had feedback on that series saying that if held correctly this
> would also nicely solve that long-time race. Maybe I'm just
> misremembering, but I (mis?)recalled that Taylor indicated that it was
> being used like that at GitHub.

It (the above) doesn't solve the race, but it does make it easier to
recover from a corrupt repository when we lose that race.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-12-14 20:25 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-07 15:58 Question: How to execute git-gc correctly on the git server ZheNing Hu
2022-12-07 23:57 ` Ævar Arnfjörð Bjarmason
2022-12-08  1:16   ` Michal Suchánek
2022-12-08  7:01     ` Jeff King
2022-12-09  0:49       ` Michal Suchánek
2022-12-09  1:37         ` Jeff King
2022-12-09  7:26           ` ZheNing Hu
2022-12-09 13:48             ` Ævar Arnfjörð Bjarmason
2022-12-11 16:01               ` ZheNing Hu
2022-12-11 16:27                 ` Michal Suchánek
2022-12-09  7:15     ` ZheNing Hu
2022-12-08  6:59   ` Jeff King
2022-12-08 12:35     ` Ævar Arnfjörð Bjarmason
2022-12-14 20:11       ` Taylor Blau

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).