git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: ZheNing Hu <adlternative@gmail.com>,
	Git List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>,
	Christian Couder <christian.couder@gmail.com>,
	johncai86@gmail.com, Taylor Blau <me@ttaylorr.com>
Subject: Re: Question: How to execute git-gc correctly on the git server
Date: Thu, 08 Dec 2022 13:35:04 +0100	[thread overview]
Message-ID: <221208.86o7se6ou1.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <Y5GLWe4Kdaz+T5P8@coredump.intra.peff.net>


On Thu, Dec 08 2022, Jeff King wrote:

> On Thu, Dec 08, 2022 at 12:57:45AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> Is this for a very large hosting site that's anywhere near GitHub,
>> GitLab's etc. scale?
>> 
>> A "git gc" on a "live" repo is always racy in theory, but the odds that
>> you'll run into data corrupting trouble tends to approach zero as you
>> increase the gc.pruneExpire setting, with the default 2 weeks being more
>> than enough for even the most paranoid user.
>
> I'm a bit less optimistic than "tends to approach zero" here. The
> objects themselves might be older, but they may become referenced or
> unreferenced in an immediate racy way. E.g., intending to move branch Z
> to branch A, a client asks to add A and remove Z. Since there is no
> atomic view of the ref namespace, a simultaneous gc might see Z gone,
> but A not yet existing (and of course it could also be _two_ clients,
> etc).

Yes, I'm really hand-waiving away the issue there for brevity.

You've got a really good summary of why exactly that race happens
somewhere in the list archive, which I'm not digging up now.

But (and just for the general audience here, I know you know this) that
race basically happens because "gc" concurrently decides that say a 2
week old blob containing "foo" isn't referenced, *and* we have a
concurrent branch push that happens to reference a "foo" blob, along
with a concurrent "gc".

I have run into this once or twice in practice (with a very high volume
in-house git server), and in those cases it was because person "B" doing
the new push was using person's "A"'s work to push a new branch, after it
had been ~gc.pruneExpire since the topic branch for "A" was
simultaneously being deleted.

In principle what I noted upthread is completely false, but I think in
practice it's almost always true. I.e. users aren't pushing random
blobs, and as time goes by the odds that the exact same content is
re-pushed go down.

It's also worth noting that some repositories are much more vulnerable
to this than others.

If you have predictable auto-generated content in your repo (think the
package+version list some languages routinely carry in-tree) you're much
more likely to run into this: Someone bumped that for a topic ~2 weeks
ago, nobody else bothered on any of their branches, and then the "A"+"B"
race described above happens.

As some practical advice to those running "gc" on live repos: To easily
mitigate this run expiry on the least busy hours of the day. Even for
truly global development teams there's usually a big lull when it's high
noon in the middle of the Pacific.

>> The "cruft pack" facility does many different things, and my
>> understanding of it is that GitHub's not using it only as an end-run
>> around potential corruption issues, but that some not yet in tree
>> patches on top of it allow more aggressive "gc" without the fear of
>> corruption.
>
> I don't think cruft packs themselves help against corruption that much.
> For many years, GitHub used "repack -k" to just never expire objects.
> What cruft packs help with is:
>
>   1. They keep cruft objects out of the main pack, which reduces the
>      costs of lookups and bitmaps for the main pack.
>
>   2. When you _do_ choose to expire, you can do so without worrying
>      about accidentally exploding all of those old objects into loose
>      ones (which is not wrong from a correctness point of view, but can
>      have some amazingly bad performance characteristics).
>
> I think the bits you're thinking of on top are in v2.39. The "repack
> --expire-to" option lets you write objects that _would_ be deleted into
> a cruft pack, which can serve as a backup (but managing that is out of
> scope for repack itself, so you have to roll your own strategy there).

Yes, that's what I was referring to.

I think I had feedback on that series saying that if held correctly this
would also nicely solve that long-time race. Maybe I'm just
misremembering, but I (mis?)recalled that Taylor indicated that it was
being used like that at GitHub.

Another thing that really helps to mitigate it is this never-in-tree
patch of mine (which ran in busy production for years, and probably
still):
https://lore.kernel.org/git/20181028225023.26427-1-avarab@gmail.com/

It's sensitive to "transfer.unpackLimit" if it's going to help with
that, and even if you always write duplicate content it won't fully help
with the race, as "B" may have seen the old ref, and hence not sent the
"foo" blob over (so the client would need to not have a copy of "A"'s
about-to-be-deleted topic).

All of which described a setup I ran it in, so *if* we ever ran into the
race then most likely we'd just have written duplicate content in the
incoming PACK, so we'd happily chug along.


  reply	other threads:[~2022-12-08 12:53 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-07 15:58 Question: How to execute git-gc correctly on the git server ZheNing Hu
2022-12-07 23:57 ` Ævar Arnfjörð Bjarmason
2022-12-08  1:16   ` Michal Suchánek
2022-12-08  7:01     ` Jeff King
2022-12-09  0:49       ` Michal Suchánek
2022-12-09  1:37         ` Jeff King
2022-12-09  7:26           ` ZheNing Hu
2022-12-09 13:48             ` Ævar Arnfjörð Bjarmason
2022-12-11 16:01               ` ZheNing Hu
2022-12-11 16:27                 ` Michal Suchánek
2022-12-09  7:15     ` ZheNing Hu
2022-12-08  6:59   ` Jeff King
2022-12-08 12:35     ` Ævar Arnfjörð Bjarmason [this message]
2022-12-14 20:11       ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=221208.86o7se6ou1.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=adlternative@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=johncai86@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).