git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Auto packing the repository - foreground or background in Windows?
@ 2022-12-01 12:25 Tao Klerks
  2022-12-06 18:03 ` Derrick Stolee
  0 siblings, 1 reply; 7+ messages in thread
From: Tao Klerks @ 2022-12-01 12:25 UTC (permalink / raw)
  To: git

Hi folks,

I came across a Windows user today whose "fetch" operations were
taking a long time, because their repository had passed some
persistent maintenance-triggering threshold *and* the resulting
auto-gc was running in the foreground (and not resolving the
maintenance-triggering condition automatically).

The user was seeing, at the end of their fetch, something like:

Auto packing the repository in background for optimum performance.
See "git help gc" for manual housekeeping.
Enumerating objects: 311322, done.
Nothing new to pack.
Checking connectivity: 1490123

Eventually, they noticed a subsequent recommendation to run "git
prune", after the connectivity check completed, and after they did the
git prune, they started getting "bad object" errors on fetch - so
there was clearly something else going wrong somewhere...

But my *question* is: Does anyone know where I could/should look to
understand why the GC was happening in the foreground, even though the
message says it will run in the background?

I don't know how to create the conditions for the auto-GC on demand
(how to create lots of loose objects??), so I don't know how to verify
whether it ever runs in the background on Windows, or what that might
depend on. I saw some discussions in 2016, but I can't tell what the
conclusion was; is it simply the case that git has been "lying" about
running GC in the background, on windows, for all these years? Or is
there something specific going on in this user's environment?

Any info welcome, thank you!

Tao

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Auto packing the repository - foreground or background in Windows?
  2022-12-01 12:25 Auto packing the repository - foreground or background in Windows? Tao Klerks
@ 2022-12-06 18:03 ` Derrick Stolee
  2022-12-06 19:19   ` Ævar Arnfjörð Bjarmason
                     ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Derrick Stolee @ 2022-12-06 18:03 UTC (permalink / raw)
  To: Tao Klerks, git

On 12/1/2022 7:25 AM, Tao Klerks wrote: 
> But my *question* is: Does anyone know where I could/should look to
> understand why the GC was happening in the foreground, even though the
> message says it will run in the background?

On Windows, Git's foreground process cannot complete without also
killing the background process. I'm not sure on the concrete details,
but the lack of a background "git gc --auto" here is deliberate for
that platform.
 
> I don't know how to create the conditions for the auto-GC on demand
> (how to create lots of loose objects??), so I don't know how to verify
> whether it ever runs in the background on Windows, or what that might
> depend on. I saw some discussions in 2016, but I can't tell what the
> conclusion was; is it simply the case that git has been "lying" about
> running GC in the background, on windows, for all these years? Or is
> there something specific going on in this user's environment?

Instead, the modern recommendation for repositories where "git gc --auto"
would be slow is to run "git maintenance start" which will schedule
background maintenance jobs with the Windows scheduler. Those processes
are built to do updates that are non-invasive to concurrent foreground
processes. It also sets config to avoid "git gc --auto" commands at the
end of foreground Git processes.

See [1] for more details.

[1] https://git-scm.com/docs/git-maintenance

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Auto packing the repository - foreground or background in Windows?
  2022-12-06 18:03 ` Derrick Stolee
@ 2022-12-06 19:19   ` Ævar Arnfjörð Bjarmason
  2022-12-06 22:41   ` Jeff Hostetler
  2022-12-08 14:52   ` Tao Klerks
  2 siblings, 0 replies; 7+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-12-06 19:19 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Tao Klerks, git


On Tue, Dec 06 2022, Derrick Stolee wrote:

> On 12/1/2022 7:25 AM, Tao Klerks wrote: 
>> But my *question* is: Does anyone know where I could/should look to
>> understand why the GC was happening in the foreground, even though the
>> message says it will run in the background?
>
> On Windows, Git's foreground process cannot complete without also
> killing the background process. I'm not sure on the concrete details,
> but the lack of a background "git gc --auto" here is deliberate for
> that platform.
>  
>> I don't know how to create the conditions for the auto-GC on demand
>> (how to create lots of loose objects??), so I don't know how to verify
>> whether it ever runs in the background on Windows, or what that might
>> depend on. I saw some discussions in 2016, but I can't tell what the
>> conclusion was; is it simply the case that git has been "lying" about
>> running GC in the background, on windows, for all these years? Or is
>> there something specific going on in this user's environment?
>
> Instead, the modern recommendation for repositories where "git gc --auto"
> would be slow is to run "git maintenance start" which will schedule
> background maintenance jobs with the Windows scheduler. Those processes
> are built to do updates that are non-invasive to concurrent foreground
> processes. It also sets config to avoid "git gc --auto" commands at the
> end of foreground Git processes.
>
> See [1] for more details.
>
> [1] https://git-scm.com/docs/git-maintenance

That's good advice, but Tao is pointing out that the message we emit is
buggy here, which is a correct.

The problem is just that on Windows we always fail to daemonize(), but
didn't correct the bits that know that to the bits that emit the
message.

I think this should fix it:
	
	diff --git a/builtin/gc.c b/builtin/gc.c
	index 02455fdcd73..a5f599ebff0 100644
	--- a/builtin/gc.c
	+++ b/builtin/gc.c
	@@ -623,9 +623,11 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
	 		if (!need_to_gc())
	 			return 0;
	 		if (!quiet) {
	+#ifndef NO_POSIX_GOODIES
	 			if (detach_auto)
	 				fprintf(stderr, _("Auto packing the repository in background for optimum performance.\n"));
	 			else
	+#endif
	 				fprintf(stderr, _("Auto packing the repository for optimum performance.\n"));
	 			fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n"));
	 		}
Tao: If you're interested do you mind carrying that (or some other
similar) patch forward?

The above is: Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Auto packing the repository - foreground or background in Windows?
  2022-12-06 18:03 ` Derrick Stolee
  2022-12-06 19:19   ` Ævar Arnfjörð Bjarmason
@ 2022-12-06 22:41   ` Jeff Hostetler
  2022-12-08 14:29     ` Tao Klerks
  2022-12-08 14:52   ` Tao Klerks
  2 siblings, 1 reply; 7+ messages in thread
From: Jeff Hostetler @ 2022-12-06 22:41 UTC (permalink / raw)
  To: Derrick Stolee, Tao Klerks, git



On 12/6/22 1:03 PM, Derrick Stolee wrote:
> On 12/1/2022 7:25 AM, Tao Klerks wrote:
>> But my *question* is: Does anyone know where I could/should look to
>> understand why the GC was happening in the foreground, even though the
>> message says it will run in the background?
> 
> On Windows, Git's foreground process cannot complete without also
> killing the background process. I'm not sure on the concrete details,
> but the lack of a background "git gc --auto" here is deliberate for
> that platform.

Here the GC code uses `daemonize()`.  On Posix this is a wrapper around
`fork()` where the parent exits and the child continues the computation
(without stdin/out).

However, `daemonize()` just returns an ENOSYS on Windows, since Windows
doesn't have `fork()`.  The net result is that the foreground process
falls thru and does the actual work.


[...]
> Instead, the modern recommendation for repositories where "git gc --auto"
> would be slow is to run "git maintenance start" which will schedule
> background maintenance jobs with the Windows scheduler. Those processes
> are built to do updates that are non-invasive to concurrent foreground
> processes. It also sets config to avoid "git gc --auto" commands at the
> end of foreground Git processes.
> 
> See [1] for more details.
> 
> [1] https://git-scm.com/docs/git-maintenance

It is possible to do the GC in the background on Windows (using other
techniques), but I don't think it is worth the bother now that we have
`git maintenance` to handle it.

And yes, AEvar's suggested fix for printing the correct error message
looks helpful.


Jeff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Auto packing the repository - foreground or background in Windows?
  2022-12-06 22:41   ` Jeff Hostetler
@ 2022-12-08 14:29     ` Tao Klerks
  0 siblings, 0 replies; 7+ messages in thread
From: Tao Klerks @ 2022-12-08 14:29 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: Derrick Stolee, git

Thank you all. I will try to carry forward a patch with this change
and a hint about the recommended maintenance approach for longer runs.

On Tue, Dec 6, 2022 at 11:41 PM Jeff Hostetler <git@jeffhostetler.com> wrote:
>
>
>
> On 12/6/22 1:03 PM, Derrick Stolee wrote:
> > On 12/1/2022 7:25 AM, Tao Klerks wrote:
> >> But my *question* is: Does anyone know where I could/should look to
> >> understand why the GC was happening in the foreground, even though the
> >> message says it will run in the background?
> >
> > On Windows, Git's foreground process cannot complete without also
> > killing the background process. I'm not sure on the concrete details,
> > but the lack of a background "git gc --auto" here is deliberate for
> > that platform.
>
> Here the GC code uses `daemonize()`.  On Posix this is a wrapper around
> `fork()` where the parent exits and the child continues the computation
> (without stdin/out).
>
> However, `daemonize()` just returns an ENOSYS on Windows, since Windows
> doesn't have `fork()`.  The net result is that the foreground process
> falls thru and does the actual work.
>
>
> [...]
> > Instead, the modern recommendation for repositories where "git gc --auto"
> > would be slow is to run "git maintenance start" which will schedule
> > background maintenance jobs with the Windows scheduler. Those processes
> > are built to do updates that are non-invasive to concurrent foreground
> > processes. It also sets config to avoid "git gc --auto" commands at the
> > end of foreground Git processes.
> >
> > See [1] for more details.
> >
> > [1] https://git-scm.com/docs/git-maintenance
>
> It is possible to do the GC in the background on Windows (using other
> techniques), but I don't think it is worth the bother now that we have
> `git maintenance` to handle it.
>
> And yes, AEvar's suggested fix for printing the correct error message
> looks helpful.
>
>
> Jeff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Auto packing the repository - foreground or background in Windows?
  2022-12-06 18:03 ` Derrick Stolee
  2022-12-06 19:19   ` Ævar Arnfjörð Bjarmason
  2022-12-06 22:41   ` Jeff Hostetler
@ 2022-12-08 14:52   ` Tao Klerks
  2022-12-09 15:11     ` Derrick Stolee
  2 siblings, 1 reply; 7+ messages in thread
From: Tao Klerks @ 2022-12-08 14:52 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git

On Tue, Dec 6, 2022 at 7:03 PM Derrick Stolee <derrickstolee@github.com> wrote:
>
> Instead, the modern recommendation for repositories where "git gc --auto"
> would be slow is to run "git maintenance start" which will schedule
> background maintenance jobs with the Windows scheduler. Those processes
> are built to do updates that are non-invasive to concurrent foreground
> processes. It also sets config to avoid "git gc --auto" commands at the
> end of foreground Git processes.
>
> See [1] for more details.
>
> [1] https://git-scm.com/docs/git-maintenance
>

Thanks Stolee, I've known about the existence of this system for a
while, but I can't quite figure out what's recommended for who, when,
given the doc at https://git-scm.com/docs/git-maintenance

Clearly on Windows, one reason to do "git maintenance start" is to
avoid foregrounded "git gc --auto" runs later. That's a clear enough
benefit to say "frequent users of large repos on windows *should* run
'git maintenance start' (or have some setup process or GUI do it for
them) on those large repos".

Is there a corresponding tangible benefit on MacOS and/or Linux, over
simply getting "git gc --auto" do its backgrounded thing when it feels
like it? Or is there an eventual plan to *switch* from the current
"git gc --auto" spawning to a "git maintenance start" execution when
trigger conditions are met? Are there any *dis*advantages to running
"git maintenance start" in general or on any given platform?

For "my users", I have something like Scalar that can start the
maintenance on the repo where it's needed - but it seems like there
will be lots of users out there in the world who clone things like the
linux repo, which looks like it is big enough to warrant these kinds
of concerns, but it doesn't seem obvious that anyone will ever find
"https://git-scm.com/docs/git-maintenance" and decide to run "git
maintenance start" on their own...

As I noted in another email, I propose to replace "Auto packing the
repository for optimum performance" with something like "Auto packing
the repository for optimum performance; to run this kind of
maintenance in the background, see 'git maintenance' at
https://git-scm.com/docs/git-maintenance." - but I imagine I'm missing
a bigger picture / a long-term plan for how these two mechanisms
should interact.

My apologies if I've missed one or many conversations about this on
the list, but maybe a pointer here can also help me add directional
hints at https://git-scm.com/docs/git-maintenance for "outside users"?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Auto packing the repository - foreground or background in Windows?
  2022-12-08 14:52   ` Tao Klerks
@ 2022-12-09 15:11     ` Derrick Stolee
  0 siblings, 0 replies; 7+ messages in thread
From: Derrick Stolee @ 2022-12-09 15:11 UTC (permalink / raw)
  To: Tao Klerks; +Cc: git

On 12/8/2022 9:52 AM, Tao Klerks wrote:
> On Tue, Dec 6, 2022 at 7:03 PM Derrick Stolee <derrickstolee@github.com> wrote:
>>
>> Instead, the modern recommendation for repositories where "git gc --auto"
>> would be slow is to run "git maintenance start" which will schedule
>> background maintenance jobs with the Windows scheduler. Those processes
>> are built to do updates that are non-invasive to concurrent foreground
>> processes. It also sets config to avoid "git gc --auto" commands at the
>> end of foreground Git processes.
>>
>> See [1] for more details.
>>
>> [1] https://git-scm.com/docs/git-maintenance
>>
> 
> Thanks Stolee, I've known about the existence of this system for a
> while, but I can't quite figure out what's recommended for who, when,
> given the doc at https://git-scm.com/docs/git-maintenance

Thanks for the feedback that this document could use a clearer
high-level description for recommended ways to use the command, and
_when_.

One goal when creating the documentation was to _not_ recommend a
specific use pattern, instead focusing on the many ways a user could
customize their maintenance patterns. Perhaps the feature has
stabilized enough (and shown its benefits) that we could add a
recommended use section.
 
> Clearly on Windows, one reason to do "git maintenance start" is to
> avoid foregrounded "git gc --auto" runs later. That's a clear enough
> benefit to say "frequent users of large repos on windows *should* run
> 'git maintenance start' (or have some setup process or GUI do it for
> them) on those large repos".
> 
> Is there a corresponding tangible benefit on MacOS and/or Linux, over
> simply getting "git gc --auto" do its backgrounded thing when it feels
> like it? Or is there an eventual plan to *switch* from the current
> "git gc --auto" spawning to a "git maintenance start" execution when
> trigger conditions are met? Are there any *dis*advantages to running
> "git maintenance start" in general or on any given platform?

For large repositories, the default 'git gc --auto' takes a lot of
resources to rewrite all object data into a single pack-file. The
background maintenance does smaller, incremental repacks. Here,
"large" means "more than 2GB of packed object data", since that's
the default limit for the incremental repacks starting a new pack.

There's other benefits where it does hourly prefetches, getting
object data from remotes before the user requests a ref update
through 'git fetch' or 'git pull'. Those foreground operations
speed up, as well.

> For "my users", I have something like Scalar that can start the
> maintenance on the repo where it's needed - but it seems like there
> will be lots of users out there in the world who clone things like the
> linux repo, which looks like it is big enough to warrant these kinds
> of concerns, but it doesn't seem obvious that anyone will ever find
> "https://git-scm.com/docs/git-maintenance" and decide to run "git
> maintenance start" on their own...

We do what we can to advertise these kinds of features, but at some
point users need to self-discover things. But that's also a motivation
for the Scalar command: the user can relax some control to allow the
Scalar command to choose those recommended settings on behalf of the
user.

> As I noted in another email, I propose to replace "Auto packing the
> repository for optimum performance" with something like "Auto packing
> the repository for optimum performance; to run this kind of
> maintenance in the background, see 'git maintenance' at
> https://git-scm.com/docs/git-maintenance." - but I imagine I'm missing
> a bigger picture / a long-term plan for how these two mechanisms
> should interact.

A message that points out 'git maintenance' like this might work best
as part of the "advice" API, so those who don't want to see the
message every time could disable it.

> My apologies if I've missed one or many conversations about this on
> the list, but maybe a pointer here can also help me add directional
> hints at https://git-scm.com/docs/git-maintenance for "outside users"?

I'm trying to think of a builtin whose documentation has such strong
"recommended use" language.

The best I could think about are commands with substantial "examples"
sections, such as 'git bundle'.

A more radical approach would be to create a new doc type that
provides recommendations for how to manage large repositories. I
imagine it would be sorted in order of increasing complexity,
something like:

 1. Use 'scalar' and see if it works for your needs.

 2. Self-serve with 'git maintenance start', 'git sparse-checkout',
    partial clone, and feature.manyFiles=true as needed.

 3. Go deep on individual plumbing commands and config options
    that provide knobs to tweak how Git manages information.

I think starting with some examples or a "recommended use" section
for 'git maintenance' would be a better first step.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-12-09 15:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-01 12:25 Auto packing the repository - foreground or background in Windows? Tao Klerks
2022-12-06 18:03 ` Derrick Stolee
2022-12-06 19:19   ` Ævar Arnfjörð Bjarmason
2022-12-06 22:41   ` Jeff Hostetler
2022-12-08 14:29     ` Tao Klerks
2022-12-08 14:52   ` Tao Klerks
2022-12-09 15:11     ` Derrick Stolee

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).