* Auto packing the repository - foreground or background in Windows? @ 2022-12-01 12:25 Tao Klerks 2022-12-06 18:03 ` Derrick Stolee 0 siblings, 1 reply; 7+ messages in thread From: Tao Klerks @ 2022-12-01 12:25 UTC (permalink / raw) To: git Hi folks, I came across a Windows user today whose "fetch" operations were taking a long time, because their repository had passed some persistent maintenance-triggering threshold *and* the resulting auto-gc was running in the foreground (and not resolving the maintenance-triggering condition automatically). The user was seeing, at the end of their fetch, something like: Auto packing the repository in background for optimum performance. See "git help gc" for manual housekeeping. Enumerating objects: 311322, done. Nothing new to pack. Checking connectivity: 1490123 Eventually, they noticed a subsequent recommendation to run "git prune", after the connectivity check completed, and after they did the git prune, they started getting "bad object" errors on fetch - so there was clearly something else going wrong somewhere... But my *question* is: Does anyone know where I could/should look to understand why the GC was happening in the foreground, even though the message says it will run in the background? I don't know how to create the conditions for the auto-GC on demand (how to create lots of loose objects??), so I don't know how to verify whether it ever runs in the background on Windows, or what that might depend on. I saw some discussions in 2016, but I can't tell what the conclusion was; is it simply the case that git has been "lying" about running GC in the background, on windows, for all these years? Or is there something specific going on in this user's environment? Any info welcome, thank you! Tao ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Auto packing the repository - foreground or background in Windows? 2022-12-01 12:25 Auto packing the repository - foreground or background in Windows? Tao Klerks @ 2022-12-06 18:03 ` Derrick Stolee 2022-12-06 19:19 ` Ævar Arnfjörð Bjarmason ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Derrick Stolee @ 2022-12-06 18:03 UTC (permalink / raw) To: Tao Klerks, git On 12/1/2022 7:25 AM, Tao Klerks wrote: > But my *question* is: Does anyone know where I could/should look to > understand why the GC was happening in the foreground, even though the > message says it will run in the background? On Windows, Git's foreground process cannot complete without also killing the background process. I'm not sure on the concrete details, but the lack of a background "git gc --auto" here is deliberate for that platform. > I don't know how to create the conditions for the auto-GC on demand > (how to create lots of loose objects??), so I don't know how to verify > whether it ever runs in the background on Windows, or what that might > depend on. I saw some discussions in 2016, but I can't tell what the > conclusion was; is it simply the case that git has been "lying" about > running GC in the background, on windows, for all these years? Or is > there something specific going on in this user's environment? Instead, the modern recommendation for repositories where "git gc --auto" would be slow is to run "git maintenance start" which will schedule background maintenance jobs with the Windows scheduler. Those processes are built to do updates that are non-invasive to concurrent foreground processes. It also sets config to avoid "git gc --auto" commands at the end of foreground Git processes. See [1] for more details. [1] https://git-scm.com/docs/git-maintenance Thanks, -Stolee ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Auto packing the repository - foreground or background in Windows? 2022-12-06 18:03 ` Derrick Stolee @ 2022-12-06 19:19 ` Ævar Arnfjörð Bjarmason 2022-12-06 22:41 ` Jeff Hostetler 2022-12-08 14:52 ` Tao Klerks 2 siblings, 0 replies; 7+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-12-06 19:19 UTC (permalink / raw) To: Derrick Stolee; +Cc: Tao Klerks, git On Tue, Dec 06 2022, Derrick Stolee wrote: > On 12/1/2022 7:25 AM, Tao Klerks wrote: >> But my *question* is: Does anyone know where I could/should look to >> understand why the GC was happening in the foreground, even though the >> message says it will run in the background? > > On Windows, Git's foreground process cannot complete without also > killing the background process. I'm not sure on the concrete details, > but the lack of a background "git gc --auto" here is deliberate for > that platform. > >> I don't know how to create the conditions for the auto-GC on demand >> (how to create lots of loose objects??), so I don't know how to verify >> whether it ever runs in the background on Windows, or what that might >> depend on. I saw some discussions in 2016, but I can't tell what the >> conclusion was; is it simply the case that git has been "lying" about >> running GC in the background, on windows, for all these years? Or is >> there something specific going on in this user's environment? > > Instead, the modern recommendation for repositories where "git gc --auto" > would be slow is to run "git maintenance start" which will schedule > background maintenance jobs with the Windows scheduler. Those processes > are built to do updates that are non-invasive to concurrent foreground > processes. It also sets config to avoid "git gc --auto" commands at the > end of foreground Git processes. > > See [1] for more details. > > [1] https://git-scm.com/docs/git-maintenance That's good advice, but Tao is pointing out that the message we emit is buggy here, which is a correct. The problem is just that on Windows we always fail to daemonize(), but didn't correct the bits that know that to the bits that emit the message. I think this should fix it: diff --git a/builtin/gc.c b/builtin/gc.c index 02455fdcd73..a5f599ebff0 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -623,9 +623,11 @@ int cmd_gc(int argc, const char **argv, const char *prefix) if (!need_to_gc()) return 0; if (!quiet) { +#ifndef NO_POSIX_GOODIES if (detach_auto) fprintf(stderr, _("Auto packing the repository in background for optimum performance.\n")); else +#endif fprintf(stderr, _("Auto packing the repository for optimum performance.\n")); fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n")); } Tao: If you're interested do you mind carrying that (or some other similar) patch forward? The above is: Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Auto packing the repository - foreground or background in Windows? 2022-12-06 18:03 ` Derrick Stolee 2022-12-06 19:19 ` Ævar Arnfjörð Bjarmason @ 2022-12-06 22:41 ` Jeff Hostetler 2022-12-08 14:29 ` Tao Klerks 2022-12-08 14:52 ` Tao Klerks 2 siblings, 1 reply; 7+ messages in thread From: Jeff Hostetler @ 2022-12-06 22:41 UTC (permalink / raw) To: Derrick Stolee, Tao Klerks, git On 12/6/22 1:03 PM, Derrick Stolee wrote: > On 12/1/2022 7:25 AM, Tao Klerks wrote: >> But my *question* is: Does anyone know where I could/should look to >> understand why the GC was happening in the foreground, even though the >> message says it will run in the background? > > On Windows, Git's foreground process cannot complete without also > killing the background process. I'm not sure on the concrete details, > but the lack of a background "git gc --auto" here is deliberate for > that platform. Here the GC code uses `daemonize()`. On Posix this is a wrapper around `fork()` where the parent exits and the child continues the computation (without stdin/out). However, `daemonize()` just returns an ENOSYS on Windows, since Windows doesn't have `fork()`. The net result is that the foreground process falls thru and does the actual work. [...] > Instead, the modern recommendation for repositories where "git gc --auto" > would be slow is to run "git maintenance start" which will schedule > background maintenance jobs with the Windows scheduler. Those processes > are built to do updates that are non-invasive to concurrent foreground > processes. It also sets config to avoid "git gc --auto" commands at the > end of foreground Git processes. > > See [1] for more details. > > [1] https://git-scm.com/docs/git-maintenance It is possible to do the GC in the background on Windows (using other techniques), but I don't think it is worth the bother now that we have `git maintenance` to handle it. And yes, AEvar's suggested fix for printing the correct error message looks helpful. Jeff ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Auto packing the repository - foreground or background in Windows? 2022-12-06 22:41 ` Jeff Hostetler @ 2022-12-08 14:29 ` Tao Klerks 0 siblings, 0 replies; 7+ messages in thread From: Tao Klerks @ 2022-12-08 14:29 UTC (permalink / raw) To: Jeff Hostetler; +Cc: Derrick Stolee, git Thank you all. I will try to carry forward a patch with this change and a hint about the recommended maintenance approach for longer runs. On Tue, Dec 6, 2022 at 11:41 PM Jeff Hostetler <git@jeffhostetler.com> wrote: > > > > On 12/6/22 1:03 PM, Derrick Stolee wrote: > > On 12/1/2022 7:25 AM, Tao Klerks wrote: > >> But my *question* is: Does anyone know where I could/should look to > >> understand why the GC was happening in the foreground, even though the > >> message says it will run in the background? > > > > On Windows, Git's foreground process cannot complete without also > > killing the background process. I'm not sure on the concrete details, > > but the lack of a background "git gc --auto" here is deliberate for > > that platform. > > Here the GC code uses `daemonize()`. On Posix this is a wrapper around > `fork()` where the parent exits and the child continues the computation > (without stdin/out). > > However, `daemonize()` just returns an ENOSYS on Windows, since Windows > doesn't have `fork()`. The net result is that the foreground process > falls thru and does the actual work. > > > [...] > > Instead, the modern recommendation for repositories where "git gc --auto" > > would be slow is to run "git maintenance start" which will schedule > > background maintenance jobs with the Windows scheduler. Those processes > > are built to do updates that are non-invasive to concurrent foreground > > processes. It also sets config to avoid "git gc --auto" commands at the > > end of foreground Git processes. > > > > See [1] for more details. > > > > [1] https://git-scm.com/docs/git-maintenance > > It is possible to do the GC in the background on Windows (using other > techniques), but I don't think it is worth the bother now that we have > `git maintenance` to handle it. > > And yes, AEvar's suggested fix for printing the correct error message > looks helpful. > > > Jeff ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Auto packing the repository - foreground or background in Windows? 2022-12-06 18:03 ` Derrick Stolee 2022-12-06 19:19 ` Ævar Arnfjörð Bjarmason 2022-12-06 22:41 ` Jeff Hostetler @ 2022-12-08 14:52 ` Tao Klerks 2022-12-09 15:11 ` Derrick Stolee 2 siblings, 1 reply; 7+ messages in thread From: Tao Klerks @ 2022-12-08 14:52 UTC (permalink / raw) To: Derrick Stolee; +Cc: git On Tue, Dec 6, 2022 at 7:03 PM Derrick Stolee <derrickstolee@github.com> wrote: > > Instead, the modern recommendation for repositories where "git gc --auto" > would be slow is to run "git maintenance start" which will schedule > background maintenance jobs with the Windows scheduler. Those processes > are built to do updates that are non-invasive to concurrent foreground > processes. It also sets config to avoid "git gc --auto" commands at the > end of foreground Git processes. > > See [1] for more details. > > [1] https://git-scm.com/docs/git-maintenance > Thanks Stolee, I've known about the existence of this system for a while, but I can't quite figure out what's recommended for who, when, given the doc at https://git-scm.com/docs/git-maintenance Clearly on Windows, one reason to do "git maintenance start" is to avoid foregrounded "git gc --auto" runs later. That's a clear enough benefit to say "frequent users of large repos on windows *should* run 'git maintenance start' (or have some setup process or GUI do it for them) on those large repos". Is there a corresponding tangible benefit on MacOS and/or Linux, over simply getting "git gc --auto" do its backgrounded thing when it feels like it? Or is there an eventual plan to *switch* from the current "git gc --auto" spawning to a "git maintenance start" execution when trigger conditions are met? Are there any *dis*advantages to running "git maintenance start" in general or on any given platform? For "my users", I have something like Scalar that can start the maintenance on the repo where it's needed - but it seems like there will be lots of users out there in the world who clone things like the linux repo, which looks like it is big enough to warrant these kinds of concerns, but it doesn't seem obvious that anyone will ever find "https://git-scm.com/docs/git-maintenance" and decide to run "git maintenance start" on their own... As I noted in another email, I propose to replace "Auto packing the repository for optimum performance" with something like "Auto packing the repository for optimum performance; to run this kind of maintenance in the background, see 'git maintenance' at https://git-scm.com/docs/git-maintenance." - but I imagine I'm missing a bigger picture / a long-term plan for how these two mechanisms should interact. My apologies if I've missed one or many conversations about this on the list, but maybe a pointer here can also help me add directional hints at https://git-scm.com/docs/git-maintenance for "outside users"? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Auto packing the repository - foreground or background in Windows? 2022-12-08 14:52 ` Tao Klerks @ 2022-12-09 15:11 ` Derrick Stolee 0 siblings, 0 replies; 7+ messages in thread From: Derrick Stolee @ 2022-12-09 15:11 UTC (permalink / raw) To: Tao Klerks; +Cc: git On 12/8/2022 9:52 AM, Tao Klerks wrote: > On Tue, Dec 6, 2022 at 7:03 PM Derrick Stolee <derrickstolee@github.com> wrote: >> >> Instead, the modern recommendation for repositories where "git gc --auto" >> would be slow is to run "git maintenance start" which will schedule >> background maintenance jobs with the Windows scheduler. Those processes >> are built to do updates that are non-invasive to concurrent foreground >> processes. It also sets config to avoid "git gc --auto" commands at the >> end of foreground Git processes. >> >> See [1] for more details. >> >> [1] https://git-scm.com/docs/git-maintenance >> > > Thanks Stolee, I've known about the existence of this system for a > while, but I can't quite figure out what's recommended for who, when, > given the doc at https://git-scm.com/docs/git-maintenance Thanks for the feedback that this document could use a clearer high-level description for recommended ways to use the command, and _when_. One goal when creating the documentation was to _not_ recommend a specific use pattern, instead focusing on the many ways a user could customize their maintenance patterns. Perhaps the feature has stabilized enough (and shown its benefits) that we could add a recommended use section. > Clearly on Windows, one reason to do "git maintenance start" is to > avoid foregrounded "git gc --auto" runs later. That's a clear enough > benefit to say "frequent users of large repos on windows *should* run > 'git maintenance start' (or have some setup process or GUI do it for > them) on those large repos". > > Is there a corresponding tangible benefit on MacOS and/or Linux, over > simply getting "git gc --auto" do its backgrounded thing when it feels > like it? Or is there an eventual plan to *switch* from the current > "git gc --auto" spawning to a "git maintenance start" execution when > trigger conditions are met? Are there any *dis*advantages to running > "git maintenance start" in general or on any given platform? For large repositories, the default 'git gc --auto' takes a lot of resources to rewrite all object data into a single pack-file. The background maintenance does smaller, incremental repacks. Here, "large" means "more than 2GB of packed object data", since that's the default limit for the incremental repacks starting a new pack. There's other benefits where it does hourly prefetches, getting object data from remotes before the user requests a ref update through 'git fetch' or 'git pull'. Those foreground operations speed up, as well. > For "my users", I have something like Scalar that can start the > maintenance on the repo where it's needed - but it seems like there > will be lots of users out there in the world who clone things like the > linux repo, which looks like it is big enough to warrant these kinds > of concerns, but it doesn't seem obvious that anyone will ever find > "https://git-scm.com/docs/git-maintenance" and decide to run "git > maintenance start" on their own... We do what we can to advertise these kinds of features, but at some point users need to self-discover things. But that's also a motivation for the Scalar command: the user can relax some control to allow the Scalar command to choose those recommended settings on behalf of the user. > As I noted in another email, I propose to replace "Auto packing the > repository for optimum performance" with something like "Auto packing > the repository for optimum performance; to run this kind of > maintenance in the background, see 'git maintenance' at > https://git-scm.com/docs/git-maintenance." - but I imagine I'm missing > a bigger picture / a long-term plan for how these two mechanisms > should interact. A message that points out 'git maintenance' like this might work best as part of the "advice" API, so those who don't want to see the message every time could disable it. > My apologies if I've missed one or many conversations about this on > the list, but maybe a pointer here can also help me add directional > hints at https://git-scm.com/docs/git-maintenance for "outside users"? I'm trying to think of a builtin whose documentation has such strong "recommended use" language. The best I could think about are commands with substantial "examples" sections, such as 'git bundle'. A more radical approach would be to create a new doc type that provides recommendations for how to manage large repositories. I imagine it would be sorted in order of increasing complexity, something like: 1. Use 'scalar' and see if it works for your needs. 2. Self-serve with 'git maintenance start', 'git sparse-checkout', partial clone, and feature.manyFiles=true as needed. 3. Go deep on individual plumbing commands and config options that provide knobs to tweak how Git manages information. I think starting with some examples or a "recommended use" section for 'git maintenance' would be a better first step. Thanks, -Stolee ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-12-09 15:12 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-12-01 12:25 Auto packing the repository - foreground or background in Windows? Tao Klerks 2022-12-06 18:03 ` Derrick Stolee 2022-12-06 19:19 ` Ævar Arnfjörð Bjarmason 2022-12-06 22:41 ` Jeff Hostetler 2022-12-08 14:29 ` Tao Klerks 2022-12-08 14:52 ` Tao Klerks 2022-12-09 15:11 ` Derrick Stolee
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).