* Avoid race condition between fetch and repack/gc? @ 2020-03-16 8:23 Andreas Krey 2020-03-16 12:10 ` Derrick Stolee 0 siblings, 1 reply; 6+ messages in thread From: Andreas Krey @ 2020-03-16 8:23 UTC (permalink / raw) To: git Hi all, we occasionally seeing things like this: | DEBUG: 11:25:20: git -c advice.fetchShowForcedUpdates=false fetch --no-show-forced-updates -q --prune | Warning: Permanently added '[socgit.$company.com]:7999' (RSA) to the list of known hosts. | remote: fatal: packfile ./objects/pack/pack-20256f2be3bd51b57e519a9f2a4d3df09f231952.pack cannot be accessed | error: git upload-pack: git-pack-objects died with error. | fatal: git upload-pack: aborting due to possible repository corruption on the remote side. | remote: aborting due to possible repository corruption on the remote side. | fatal: protocol error: bad pack header and when you look in the server repository there is a new packfile dated just around that time. It looks like the fetch tries to access a packfile that it assumes to exist, but the GC on the server throws it away just in that moment, and thus upload-pack fails. Is there a way to avoid this? Should there be, like git repack waiting a bit before deleting old packfiles? - Andreas -- "Totally trivial. Famous last words." From: Linus Torvalds <torvalds@*.org> Date: Fri, 22 Jan 2010 07:29:21 -0800 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Avoid race condition between fetch and repack/gc? 2020-03-16 8:23 Avoid race condition between fetch and repack/gc? Andreas Krey @ 2020-03-16 12:10 ` Derrick Stolee 2020-03-16 17:17 ` Nasser Grainawi 2020-03-16 17:27 ` Jeff King 0 siblings, 2 replies; 6+ messages in thread From: Derrick Stolee @ 2020-03-16 12:10 UTC (permalink / raw) To: Andreas Krey, git, peff@peff.net On 3/16/2020 4:23 AM, Andreas Krey wrote: > Hi all, > > we occasionally seeing things like this: > > | DEBUG: 11:25:20: git -c advice.fetchShowForcedUpdates=false fetch --no-show-forced-updates -q --prune I'm happy to see these options. I hope they are helping you! > | Warning: Permanently added '[socgit.$company.com]:7999' (RSA) to the list of known hosts. > | remote: fatal: packfile ./objects/pack/pack-20256f2be3bd51b57e519a9f2a4d3df09f231952.pack cannot be accessed This _could_ mean a lot of things, but.... > | error: git upload-pack: git-pack-objects died with error. > | fatal: git upload-pack: aborting due to possible repository corruption on the remote side. > | remote: aborting due to possible repository corruption on the remote side. > | fatal: protocol error: bad pack header > > and when you look in the server repository there is a new packfile dated just around > that time. It looks like the fetch tries to access a packfile that it assumes to exist, > but the GC on the server throws it away just in that moment, and thus upload-pack fails. ...your intuition about repacking seems accurate. The important part of the race condition is likely that the server process read and holds a read handle on the .idx file, but when looking for the object contents it tries to open the .pack file which was deleted. This error is emitted by use_pack() in packfile.c. I'm surprised there is no fallback here, and we simply die(). The race condition seems to be related to the loop in do_oid_object_info_extended() in sha1-file.c looping through packs until finding the object in question: it does not verify that the .pack file is open with a valid handle before terminating the loop and calling packed_object_info(). Perhaps the fix is to update do_oid_object_info_extended() to "accept" a pack as the home of the object only after verifying the pack is either open or can be opened. That seems like the least-invasive fix to me. The more-invasive fix is to modify the stack from packed_object_info() to use_pack() to use error messages and return codes instead of die(). This would still need to affect the loop in do_oid_object_info_extended(), but may be a better way to handle this situation in general. Of course, this is a very critical code path, and maybe other community members have more context as to why we are not already doing this? > Is there a way to avoid this? > > Should there be, like git repack waiting a bit before deleting old packfiles? This all depends on how you are managing your server. It is likely that you could create your own maintenance that handles this for you. The "git multi-pack-index (expire|repack)" cycle is built to prevent this sort of issue, but is not yet integrated well with reachability bitmaps. You likely require the bitmaps to keep your server performance, so that may not be a way forward for you. Thanks, -Stolee ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Avoid race condition between fetch and repack/gc? 2020-03-16 12:10 ` Derrick Stolee @ 2020-03-16 17:17 ` Nasser Grainawi 2020-03-16 17:27 ` Jeff King 1 sibling, 0 replies; 6+ messages in thread From: Nasser Grainawi @ 2020-03-16 17:17 UTC (permalink / raw) To: Derrick Stolee; +Cc: Andreas Krey, git, peff@peff.net > On Mar 16, 2020, at 6:10 AM, Derrick Stolee <stolee@gmail.com> wrote: > > On 3/16/2020 4:23 AM, Andreas Krey wrote: >> Hi all, >> >> we occasionally seeing things like this: >> >> | DEBUG: 11:25:20: git -c advice.fetchShowForcedUpdates=false fetch --no-show-forced-updates -q --prune > > I'm happy to see these options. I hope they are helping you! > >> | Warning: Permanently added '[socgit.$company.com]:7999' (RSA) to the list of known hosts. >> | remote: fatal: packfile ./objects/pack/pack-20256f2be3bd51b57e519a9f2a4d3df09f231952.pack cannot be accessed > This _could_ mean a lot of things, but.... > >> | error: git upload-pack: git-pack-objects died with error. >> | fatal: git upload-pack: aborting due to possible repository corruption on the remote side. >> | remote: aborting due to possible repository corruption on the remote side. >> | fatal: protocol error: bad pack header >> >> and when you look in the server repository there is a new packfile dated just around >> that time. It looks like the fetch tries to access a packfile that it assumes to exist, >> but the GC on the server throws it away just in that moment, and thus upload-pack fails. > > ...your intuition about repacking seems accurate. The important part of the > race condition is likely that the server process read and holds a read handle > on the .idx file, but when looking for the object contents it tries to open > the .pack file which was deleted. > [snip] > >> Is there a way to avoid this? >> >> Should there be, like git repack waiting a bit before deleting old packfiles? > > This all depends on how you are managing your server. It is likely that you > could create your own maintenance that handles this for you. > > The "git multi-pack-index (expire|repack)" cycle is built to prevent this sort > of issue, but is not yet integrated well with reachability bitmaps. You likely > require the bitmaps to keep your server performance, so that may not be a way > forward for you. We manage this on our servers with a repack wrapper that first creates hard links for all packfiles into a objects/pack/preserved dir and then we have patches on top of JGit [1] that actually know how to recover objects from that dir when the original pack is removed by repacking. It’s worked quite well for us for a couple years now and should be compatible with/without bitmaps (haven’t specifically tested) and any pack/repacking strategy. [1] https://git.eclipse.org/r/122288 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Avoid race condition between fetch and repack/gc? 2020-03-16 12:10 ` Derrick Stolee 2020-03-16 17:17 ` Nasser Grainawi @ 2020-03-16 17:27 ` Jeff King 2020-03-16 23:40 ` Bryan Turner 1 sibling, 1 reply; 6+ messages in thread From: Jeff King @ 2020-03-16 17:27 UTC (permalink / raw) To: Derrick Stolee; +Cc: Andreas Krey, git On Mon, Mar 16, 2020 at 08:10:56AM -0400, Derrick Stolee wrote: > > | error: git upload-pack: git-pack-objects died with error. > > | fatal: git upload-pack: aborting due to possible repository corruption on the remote side. > > | remote: aborting due to possible repository corruption on the remote side. > > | fatal: protocol error: bad pack header > > > > and when you look in the server repository there is a new packfile dated just around > > that time. It looks like the fetch tries to access a packfile that it assumes to exist, > > but the GC on the server throws it away just in that moment, and thus upload-pack fails. > > ...your intuition about repacking seems accurate. The important part of the > race condition is likely that the server process read and holds a read handle > on the .idx file, but when looking for the object contents it tries to open > the .pack file which was deleted. > > This error is emitted by use_pack() in packfile.c. I'm surprised there is no > fallback here, and we simply die(). Yes, you're correct this is what's going on, but there's some more subtlety (see below). The issue is that by the time we get to use_pack(), it's generally too late; we've committed to use this particular packed representation of the object, and it would be very hard to back out of it. There are really two cases worth considering independently: normal access of objects via read_object_file(), oid_object_info(), etc versus pack-objects. In the normal code paths, contrary to what you wrote here: > The race condition seems to be related to the loop in do_oid_object_info_extended() > in sha1-file.c looping through packs until finding the object in question: it does > not verify that the .pack file is open with a valid handle before terminating the > loop and calling packed_object_info(). > > Perhaps the fix is to update do_oid_object_info_extended() to "accept" a pack as > the home of the object only after verifying the pack is either open or can be > opened. That seems like the least-invasive fix to me. we do make sure we have an open handle to the pack. This happens in fill_pack_entry(), which does: /* * We are about to tell the caller where they can locate the * requested object. We better make sure the packfile is * still here and can be accessed before supplying that * answer, as it may have been deleted since the index was * loaded! */ if (!is_pack_valid(p)) return 0; And is_pack_valid() not only checks the pack, but leaves open either the descriptor or our mmap'd handle. So in do_oid_object_info_extended(), we call fill_pack_entry() and we won't consider the object available unless we have a handle open to the pack. And since nobody else is manipulating the handles in between (because any multi-threaded locking is much coarser than this), it should be race-proof. But pack-objects is not so lucky. Because it wants to look at intimate details of the packed objects (e.g., reusing on-disk deltas or zlib-deflated contents), it stores away the packfile/offset pair for objects during the "Counting" phase and then uses it much later in the "Writing" phase. So it's possible for other lookups in the meantime to cause us to drop our descriptor or mmap handle, and then later we find we can't re-open it. But in practice, this shouldn't happen as long as: - you have a reasonable number of packs compared to your file descriptor limit - you don't use config like core.packedGitLimit that encourages Git to drop mmap'd windows. On 64-bit systems, this doesn't help at all (we have plenty of address space, and since the maps are read-only, the OS is free to drop the clean pages if there's memory pressure). On 32-bit systems, it's a necessity (and I'd expect a 32-bit server to run into this issue a lot for large repositories). We used to see this quite a lot at GitHub, but hardly at all since 4c08018204 (pack-objects: protect against disappearing packs, 2011-10-14), which made sure that pack-objects does the same is_pack_valid() check. The only time we see it now are degenerate pack cases (e.g., we accidentally hit a state with 20,000 packs or something). It's possible there's been a regression recently, though. Our version right now is based on v2.24.1. > The "git multi-pack-index (expire|repack)" cycle is built to prevent this sort > of issue, but is not yet integrated well with reachability bitmaps. You likely > require the bitmaps to keep your server performance, so that may not be a way > forward for you. I wondered if the midx code might suffer from a race by missing out on the is_pack_valid() check that fill_pack_entry() does. But it's right there (including the copied comment!) in nth_midxed_pack_entry(). So they should behave the same. -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Avoid race condition between fetch and repack/gc? 2020-03-16 17:27 ` Jeff King @ 2020-03-16 23:40 ` Bryan Turner 2020-03-17 18:41 ` Jeff King 0 siblings, 1 reply; 6+ messages in thread From: Bryan Turner @ 2020-03-16 23:40 UTC (permalink / raw) To: Jeff King; +Cc: Derrick Stolee, Andreas Krey, Git Users On Mon, Mar 16, 2020 at 10:27 AM Jeff King <peff@peff.net> wrote: > > - you don't use config like core.packedGitLimit that encourages Git to > drop mmap'd windows. On 64-bit systems, this doesn't help at all (we > have plenty of address space, and since the maps are read-only, the > OS is free to drop the clean pages if there's memory pressure). On > 32-bit systems, it's a necessity (and I'd expect a 32-bit server to > run into this issue a lot for large repositories). I could be wrong, but I'm going to _guess_ from the :7999 in the example output that the repository is hosted with Bitbucket Server. Assuming my guess is correct, Bitbucket Server _does_ set "core.packedgitlimit=256m" (and "core.packedgitwindowsize=32m", for what it's worth). Those settings are applied specifically because we've found they _do_ impact Git's overall memory usage when serving clones in particular, which is important for cases where the system is serving dozens (or hundreds) of concurrent clones. Of course, we don't always do a great job of re-testing once-beneficial settings later, which means sometimes they end up being based on outdated observations. Perhaps we should prioritize some additional testing here, especially on 64-bit systems. (We've been setting "core.packedgitlimit" since back when Bitbucket Server was called Stash and supported Git 1.7.6 on the server.) That said, though, you note "core.packedgitlimit" is necessary on 32-bit servers and, unfortunately, we do still support Bitbucket Server on 32-bit OSes. Maybe we should investigate applying (or not) the flags depending on the platform. Sadly, that's not necessarily simple to do since just because the _OS_ is 64-bit doesn't mean _Git_ is; it's pretty trivial to run 32-bit Git on 64-bit Windows, for example. Bryan ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Avoid race condition between fetch and repack/gc? 2020-03-16 23:40 ` Bryan Turner @ 2020-03-17 18:41 ` Jeff King 0 siblings, 0 replies; 6+ messages in thread From: Jeff King @ 2020-03-17 18:41 UTC (permalink / raw) To: Bryan Turner; +Cc: Derrick Stolee, Andreas Krey, Git Users On Mon, Mar 16, 2020 at 04:40:13PM -0700, Bryan Turner wrote: > On Mon, Mar 16, 2020 at 10:27 AM Jeff King <peff@peff.net> wrote: > > > > - you don't use config like core.packedGitLimit that encourages Git to > > drop mmap'd windows. On 64-bit systems, this doesn't help at all (we > > have plenty of address space, and since the maps are read-only, the > > OS is free to drop the clean pages if there's memory pressure). On > > 32-bit systems, it's a necessity (and I'd expect a 32-bit server to > > run into this issue a lot for large repositories). > > I could be wrong, but I'm going to _guess_ from the :7999 in the > example output that the repository is hosted with Bitbucket Server. > Assuming my guess is correct, Bitbucket Server _does_ set > "core.packedgitlimit=256m" (and "core.packedgitwindowsize=32m", for > what it's worth). Thanks for letting me know. That would definitely explain the behavior Andreas is seeing. > Those settings are applied specifically because > we've found they _do_ impact Git's overall memory usage when serving > clones in particular, which is important for cases where the system is > serving dozens (or hundreds) of concurrent clones. I do think there's an open question here, but it's also really easy to be misled by common metrics. If we imagine a repo with a 512MB packfile and two scenarios: - Git mmap's the whole packfile at once, and accesses pages within that mmap - Git mmap's no more than 256MB at once, shifting its window around as necessary and then we get a bunch of clones. Then I'd expect to see: - virtual memory size for those Git processes will be higher. But much of that will be shared pages with each other. Measuring something like Proportional Set Size (PSS) yields a more useful number. - the Resident Set Size (RSS) of those processes will also be higher, because the OS may be leaving pages in the mmap resident. However, in my experience this is often a sign that there _isn't_ memory pressure on the system. Because if there was, then infrequently used pages would be dropped by the OS (and they should be among the first to go, as they're by definition clean pages. Though they do presumably compete with other read-only disk cache). Or another way to think about it: the mmap patterns don't change the working set patterns of Git. They just change what the OS knows, and possibly how it reacts. And that's the "open question" for me: does the operating system react significantly differently under memory pressure for a big mmap with infrequently accessed pages than it would for a series of smaller maps. I don't know the answer. Our servers are all Linux, and we've tended to just trust that the operating system's page-level decisions are sensible. But I don't have any real numbers to support that. We stopped using core.packedGitLimit in 2012 (because of this issue), and everything has been good enough since to not bother looking into it more. Memory pressure for us is usually from actual heap usage (e.g., pack-objects has a high per-object cost plus delta window size times max object size). > Of course, we don't always do a great job of re-testing > once-beneficial settings later, which means sometimes they end up > being based on outdated observations. Perhaps we should prioritize > some additional testing here, especially on 64-bit systems. (We've > been setting "core.packedgitlimit" since back when Bitbucket Server > was called Stash and supported Git 1.7.6 on the server.) It sounds like you don't have any recent numbers, either. :) > That said, though, you note "core.packedgitlimit" is necessary on > 32-bit servers and, unfortunately, we do still support Bitbucket > Server on 32-bit OSes. Maybe we should investigate applying (or not) > the flags depending on the platform. Sadly, that's not necessarily > simple to do since just because the _OS_ is 64-bit doesn't mean _Git_ > is; it's pretty trivial to run 32-bit Git on 64-bit Windows, for > example. The default for core.packedGitLimit is already 256MB on 32-bit builds of Git (based on sizeof(void *), so truly on the build and not the OS). So if you just left it unset, it would do what you want. -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-03-17 18:41 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-03-16 8:23 Avoid race condition between fetch and repack/gc? Andreas Krey 2020-03-16 12:10 ` Derrick Stolee 2020-03-16 17:17 ` Nasser Grainawi 2020-03-16 17:27 ` Jeff King 2020-03-16 23:40 ` Bryan Turner 2020-03-17 18:41 ` Jeff King
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).