Avoid race condition between fetch and repack/gc?

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Avoid race condition between fetch and repack/gc?
@ 2020-03-16  8:23 Andreas Krey
  2020-03-16 12:10 ` Derrick Stolee
  0 siblings, 1 reply; 6+ messages in thread
From: Andreas Krey @ 2020-03-16  8:23 UTC (permalink / raw)
  To: git

Hi all,

we occasionally seeing things like this:

| DEBUG: 11:25:20: git -c advice.fetchShowForcedUpdates=false fetch --no-show-forced-updates -q --prune
| Warning: Permanently added '[socgit.$company.com]:7999' (RSA) to the list of known hosts.
| remote: fatal: packfile ./objects/pack/pack-20256f2be3bd51b57e519a9f2a4d3df09f231952.pack cannot be accessed        
| error: git upload-pack: git-pack-objects died with error.
| fatal: git upload-pack: aborting due to possible repository corruption on the remote side.
| remote: aborting due to possible repository corruption on the remote side.
| fatal: protocol error: bad pack header

and when you look in the server repository there is a new packfile dated just around
that time. It looks like the fetch tries to access a packfile that it assumes to exist,
but the GC on the server throws it away just in that moment, and thus upload-pack fails.

Is there a way to avoid this?

Should there be, like git repack waiting a bit before deleting old packfiles?

- Andreas

-- 
"Totally trivial. Famous last words."
From: Linus Torvalds <torvalds@*.org>
Date: Fri, 22 Jan 2010 07:29:21 -0800

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Avoid race condition between fetch and repack/gc?
  2020-03-16  8:23 Avoid race condition between fetch and repack/gc? Andreas Krey
@ 2020-03-16 12:10 ` Derrick Stolee
  2020-03-16 17:17   ` Nasser Grainawi
  2020-03-16 17:27   ` Jeff King
  0 siblings, 2 replies; 6+ messages in thread
From: Derrick Stolee @ 2020-03-16 12:10 UTC (permalink / raw)
  To: Andreas Krey, git, peff@peff.net

On 3/16/2020 4:23 AM, Andreas Krey wrote:
> Hi all,
> 
> we occasionally seeing things like this:
> 
> | DEBUG: 11:25:20: git -c advice.fetchShowForcedUpdates=false fetch --no-show-forced-updates -q --prune

I'm happy to see these options. I hope they are helping you!

> | Warning: Permanently added '[socgit.$company.com]:7999' (RSA) to the list of known hosts.
> | remote: fatal: packfile ./objects/pack/pack-20256f2be3bd51b57e519a9f2a4d3df09f231952.pack cannot be accessed        
This _could_ mean a lot of things, but....

> | error: git upload-pack: git-pack-objects died with error.
> | fatal: git upload-pack: aborting due to possible repository corruption on the remote side.
> | remote: aborting due to possible repository corruption on the remote side.
> | fatal: protocol error: bad pack header
> 
> and when you look in the server repository there is a new packfile dated just around
> that time. It looks like the fetch tries to access a packfile that it assumes to exist,
> but the GC on the server throws it away just in that moment, and thus upload-pack fails.

...your intuition about repacking seems accurate. The important part of the
race condition is likely that the server process read and holds a read handle
on the .idx file, but when looking for the object contents it tries to open
the .pack file which was deleted.

This error is emitted by use_pack() in packfile.c. I'm surprised there is no
fallback here, and we simply die().

The race condition seems to be related to the loop in do_oid_object_info_extended()
in sha1-file.c looping through packs until finding the object in question: it does
not verify that the .pack file is open with a valid handle before terminating the
loop and calling packed_object_info().

Perhaps the fix is to update do_oid_object_info_extended() to "accept" a pack as
the home of the object only after verifying the pack is either open or can be
opened. That seems like the least-invasive fix to me.

The more-invasive fix is to modify the stack from packed_object_info() to
use_pack() to use error messages and return codes instead of die(). This would
still need to affect the loop in do_oid_object_info_extended(), but may be a
better way to handle this situation in general.

Of course, this is a very critical code path, and maybe other community members
have more context as to why we are not already doing this?

> Is there a way to avoid this?
> 
> Should there be, like git repack waiting a bit before deleting old packfiles?

This all depends on how you are managing your server. It is likely that you
could create your own maintenance that handles this for you.

The "git multi-pack-index (expire|repack)" cycle is built to prevent this sort
of issue, but is not yet integrated well with reachability bitmaps. You likely
require the bitmaps to keep your server performance, so that may not be a way
forward for you.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Avoid race condition between fetch and repack/gc?
  2020-03-16 12:10 ` Derrick Stolee
@ 2020-03-16 17:17   ` Nasser Grainawi
  2020-03-16 17:27   ` Jeff King
  1 sibling, 0 replies; 6+ messages in thread
From: Nasser Grainawi @ 2020-03-16 17:17 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Andreas Krey, git, peff@peff.net


> On Mar 16, 2020, at 6:10 AM, Derrick Stolee <stolee@gmail.com> wrote:
> 
> On 3/16/2020 4:23 AM, Andreas Krey wrote:
>> Hi all,
>> 
>> we occasionally seeing things like this:
>> 
>> | DEBUG: 11:25:20: git -c advice.fetchShowForcedUpdates=false fetch --no-show-forced-updates -q --prune
> 
> I'm happy to see these options. I hope they are helping you!
> 
>> | Warning: Permanently added '[socgit.$company.com]:7999' (RSA) to the list of known hosts.
>> | remote: fatal: packfile ./objects/pack/pack-20256f2be3bd51b57e519a9f2a4d3df09f231952.pack cannot be accessed        
> This _could_ mean a lot of things, but....
> 
>> | error: git upload-pack: git-pack-objects died with error.
>> | fatal: git upload-pack: aborting due to possible repository corruption on the remote side.
>> | remote: aborting due to possible repository corruption on the remote side.
>> | fatal: protocol error: bad pack header
>> 
>> and when you look in the server repository there is a new packfile dated just around
>> that time. It looks like the fetch tries to access a packfile that it assumes to exist,
>> but the GC on the server throws it away just in that moment, and thus upload-pack fails.
> 
> ...your intuition about repacking seems accurate. The important part of the
> race condition is likely that the server process read and holds a read handle
> on the .idx file, but when looking for the object contents it tries to open
> the .pack file which was deleted.
> 

[snip]

> 
>> Is there a way to avoid this?
>> 
>> Should there be, like git repack waiting a bit before deleting old packfiles?
> 
> This all depends on how you are managing your server. It is likely that you
> could create your own maintenance that handles this for you.
> 
> The "git multi-pack-index (expire|repack)" cycle is built to prevent this sort
> of issue, but is not yet integrated well with reachability bitmaps. You likely
> require the bitmaps to keep your server performance, so that may not be a way
> forward for you.

We manage this on our servers with a repack wrapper that first creates hard links for all packfiles into a objects/pack/preserved dir and then we have patches on top of JGit [1] that actually know how to recover objects from that dir when the original pack is removed by repacking. It’s worked quite well for us for a couple years now and should be compatible with/without bitmaps (haven’t specifically tested) and any pack/repacking strategy.

[1] https://git.eclipse.org/r/122288

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Avoid race condition between fetch and repack/gc?
  2020-03-16 12:10 ` Derrick Stolee
  2020-03-16 17:17   ` Nasser Grainawi
@ 2020-03-16 17:27   ` Jeff King
  2020-03-16 23:40     ` Bryan Turner
  1 sibling, 1 reply; 6+ messages in thread
From: Jeff King @ 2020-03-16 17:27 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Andreas Krey, git

On Mon, Mar 16, 2020 at 08:10:56AM -0400, Derrick Stolee wrote:

> > | error: git upload-pack: git-pack-objects died with error.
> > | fatal: git upload-pack: aborting due to possible repository corruption on the remote side.
> > | remote: aborting due to possible repository corruption on the remote side.
> > | fatal: protocol error: bad pack header
> > 
> > and when you look in the server repository there is a new packfile dated just around
> > that time. It looks like the fetch tries to access a packfile that it assumes to exist,
> > but the GC on the server throws it away just in that moment, and thus upload-pack fails.
> 
> ...your intuition about repacking seems accurate. The important part of the
> race condition is likely that the server process read and holds a read handle
> on the .idx file, but when looking for the object contents it tries to open
> the .pack file which was deleted.
>
> This error is emitted by use_pack() in packfile.c. I'm surprised there is no
> fallback here, and we simply die().

Yes, you're correct this is what's going on, but there's some more
subtlety (see below).

The issue is that by the time we get to use_pack(), it's generally too
late; we've committed to use this particular packed representation of
the object, and it would be very hard to back out of it.

There are really two cases worth considering independently: normal
access of objects via read_object_file(), oid_object_info(), etc versus
pack-objects.

In the normal code paths, contrary to what you wrote here:

> The race condition seems to be related to the loop in do_oid_object_info_extended()
> in sha1-file.c looping through packs until finding the object in question: it does
> not verify that the .pack file is open with a valid handle before terminating the
> loop and calling packed_object_info().
>
> Perhaps the fix is to update do_oid_object_info_extended() to "accept" a pack as
> the home of the object only after verifying the pack is either open or can be
> opened. That seems like the least-invasive fix to me.

we do make sure we have an open handle to the pack. This happens in
fill_pack_entry(), which does:

          /*
           * We are about to tell the caller where they can locate the
           * requested object.  We better make sure the packfile is
           * still here and can be accessed before supplying that
           * answer, as it may have been deleted since the index was
           * loaded!
           */
          if (!is_pack_valid(p))
                  return 0;

And is_pack_valid() not only checks the pack, but leaves open either the
descriptor or our mmap'd handle.

So in do_oid_object_info_extended(), we call fill_pack_entry() and we
won't consider the object available unless we have a handle open to the
pack. And since nobody else is manipulating the handles in between
(because any multi-threaded locking is much coarser than this), it
should be race-proof.

But pack-objects is not so lucky. Because it wants to look at intimate
details of the packed objects (e.g., reusing on-disk deltas or
zlib-deflated contents), it stores away the packfile/offset pair for
objects during the "Counting" phase and then uses it much later in the
"Writing" phase.

So it's possible for other lookups in the meantime to cause us to drop
our descriptor or mmap handle, and then later we find we can't re-open
it. But in practice, this shouldn't happen as long as:

  - you have a reasonable number of packs compared to your file
    descriptor limit

  - you don't use config like core.packedGitLimit that encourages Git to
    drop mmap'd windows. On 64-bit systems, this doesn't help at all (we
    have plenty of address space, and since the maps are read-only, the
    OS is free to drop the clean pages if there's memory pressure). On
    32-bit systems, it's a necessity (and I'd expect a 32-bit server to
    run into this issue a lot for large repositories).

We used to see this quite a lot at GitHub, but hardly at all since
4c08018204 (pack-objects: protect against disappearing packs,
2011-10-14), which made sure that pack-objects does the same
is_pack_valid() check. The only time we see it now are degenerate pack
cases (e.g., we accidentally hit a state with 20,000 packs or
something). It's possible there's been a regression recently, though.
Our version right now is based on v2.24.1.

> The "git multi-pack-index (expire|repack)" cycle is built to prevent this sort
> of issue, but is not yet integrated well with reachability bitmaps. You likely
> require the bitmaps to keep your server performance, so that may not be a way
> forward for you.

I wondered if the midx code might suffer from a race by missing out on
the is_pack_valid() check that fill_pack_entry() does. But it's right
there (including the copied comment!) in nth_midxed_pack_entry(). So
they should behave the same.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Avoid race condition between fetch and repack/gc?
  2020-03-16 17:27   ` Jeff King
@ 2020-03-16 23:40     ` Bryan Turner
  2020-03-17 18:41       ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Bryan Turner @ 2020-03-16 23:40 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, Andreas Krey, Git Users

On Mon, Mar 16, 2020 at 10:27 AM Jeff King <peff@peff.net> wrote:
>
>   - you don't use config like core.packedGitLimit that encourages Git to
>     drop mmap'd windows. On 64-bit systems, this doesn't help at all (we
>     have plenty of address space, and since the maps are read-only, the
>     OS is free to drop the clean pages if there's memory pressure). On
>     32-bit systems, it's a necessity (and I'd expect a 32-bit server to
>     run into this issue a lot for large repositories).

I could be wrong, but I'm going to _guess_ from the :7999 in the
example output that the repository is hosted with Bitbucket Server.
Assuming my guess is correct, Bitbucket Server _does_ set
"core.packedgitlimit=256m" (and "core.packedgitwindowsize=32m", for
what it's worth). Those settings are applied specifically because
we've found they _do_ impact Git's overall memory usage when serving
clones in particular, which is important for cases where the system is
serving dozens (or hundreds) of concurrent clones.

Of course, we don't always do a great job of re-testing
once-beneficial settings later, which means sometimes they end up
being based on outdated observations. Perhaps we should prioritize
some additional testing here, especially on 64-bit systems. (We've
been setting "core.packedgitlimit" since back when Bitbucket Server
was called Stash and supported Git 1.7.6 on the server.)

That said, though, you note "core.packedgitlimit" is necessary on
32-bit servers and, unfortunately, we do still support Bitbucket
Server on 32-bit OSes. Maybe we should investigate applying (or not)
the flags depending on the platform. Sadly, that's not necessarily
simple to do since just because the _OS_ is 64-bit doesn't mean _Git_
is; it's pretty trivial to run 32-bit Git on 64-bit Windows, for
example.

Bryan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Avoid race condition between fetch and repack/gc?
  2020-03-16 23:40     ` Bryan Turner
@ 2020-03-17 18:41       ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2020-03-17 18:41 UTC (permalink / raw)
  To: Bryan Turner; +Cc: Derrick Stolee, Andreas Krey, Git Users

On Mon, Mar 16, 2020 at 04:40:13PM -0700, Bryan Turner wrote:

> On Mon, Mar 16, 2020 at 10:27 AM Jeff King <peff@peff.net> wrote:
> >
> >   - you don't use config like core.packedGitLimit that encourages Git to
> >     drop mmap'd windows. On 64-bit systems, this doesn't help at all (we
> >     have plenty of address space, and since the maps are read-only, the
> >     OS is free to drop the clean pages if there's memory pressure). On
> >     32-bit systems, it's a necessity (and I'd expect a 32-bit server to
> >     run into this issue a lot for large repositories).
> 
> I could be wrong, but I'm going to _guess_ from the :7999 in the
> example output that the repository is hosted with Bitbucket Server.
> Assuming my guess is correct, Bitbucket Server _does_ set
> "core.packedgitlimit=256m" (and "core.packedgitwindowsize=32m", for
> what it's worth).

Thanks for letting me know. That would definitely explain the behavior
Andreas is seeing.

> Those settings are applied specifically because
> we've found they _do_ impact Git's overall memory usage when serving
> clones in particular, which is important for cases where the system is
> serving dozens (or hundreds) of concurrent clones.

I do think there's an open question here, but it's also really easy to
be misled by common metrics. If we imagine a repo with a 512MB packfile
and two scenarios:

  - Git mmap's the whole packfile at once, and accesses pages within
    that mmap

  - Git mmap's no more than 256MB at once, shifting its window around as
    necessary

and then we get a bunch of clones. Then I'd expect to see:

  - virtual memory size for those Git processes will be higher. But much
    of that will be shared pages with each other. Measuring something
    like Proportional Set Size (PSS) yields a more useful number.

  - the Resident Set Size (RSS) of those processes will also be higher,
    because the OS may be leaving pages in the mmap resident. However,
    in my experience this is often a sign that there _isn't_ memory
    pressure on the system. Because if there was, then infrequently used
    pages would be dropped by the OS (and they should be among the first
    to go, as they're by definition clean pages. Though they do
    presumably compete with other read-only disk cache).

Or another way to think about it: the mmap patterns don't change the
working set patterns of Git. They just change what the OS knows, and
possibly how it reacts. And that's the "open question" for me: does the
operating system react significantly differently under memory pressure
for a big mmap with infrequently accessed pages than it would for a
series of smaller maps. I don't know the answer. Our servers are all
Linux, and we've tended to just trust that the operating system's
page-level decisions are sensible.

But I don't have any real numbers to support that. We stopped using
core.packedGitLimit in 2012 (because of this issue), and everything has
been good enough since to not bother looking into it more. Memory
pressure for us is usually from actual heap usage (e.g., pack-objects
has a high per-object cost plus delta window size times max object
size).

> Of course, we don't always do a great job of re-testing
> once-beneficial settings later, which means sometimes they end up
> being based on outdated observations. Perhaps we should prioritize
> some additional testing here, especially on 64-bit systems. (We've
> been setting "core.packedgitlimit" since back when Bitbucket Server
> was called Stash and supported Git 1.7.6 on the server.)

It sounds like you don't have any recent numbers, either. :)

> That said, though, you note "core.packedgitlimit" is necessary on
> 32-bit servers and, unfortunately, we do still support Bitbucket
> Server on 32-bit OSes. Maybe we should investigate applying (or not)
> the flags depending on the platform. Sadly, that's not necessarily
> simple to do since just because the _OS_ is 64-bit doesn't mean _Git_
> is; it's pretty trivial to run 32-bit Git on 64-bit Windows, for
> example.

The default for core.packedGitLimit is already 256MB on 32-bit builds of
Git (based on sizeof(void *), so truly on the build and not the OS). So
if you just left it unset, it would do what you want.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-03-17 18:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-16  8:23 Avoid race condition between fetch and repack/gc? Andreas Krey
2020-03-16 12:10 ` Derrick Stolee
2020-03-16 17:17   ` Nasser Grainawi
2020-03-16 17:27   ` Jeff King
2020-03-16 23:40     ` Bryan Turner
2020-03-17 18:41       ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).