git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
@ 2020-07-07 14:21 Derrick Stolee via GitGitGadget
  2020-07-08 23:57 ` Emily Shaffer
  0 siblings, 1 reply; 14+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2020-07-07 14:21 UTC (permalink / raw)
  To: git
  Cc: Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

This is a second attempt at redesigning Git's repository maintenance
patterns. The first attempt [1] included a way to run jobs in the background
using a long-lived process; that idea was rejected and is not included in
this series. A future series will use the OS to handle scheduling tasks.

[1] 
https://lore.kernel.org/git/pull.597.git.1585946894.gitgitgadget@gmail.com/

As mentioned before, git gc already plays the role of maintaining Git
repositories. It has accumulated several smaller pieces in its long history,
including:

 1. Repacking all reachable objects into one pack-file (and deleting
    unreachable objects).
 2. Packing refs.
 3. Expiring reflogs.
 4. Clearing rerere logs.
 5. Updating the commit-graph file.

While expiring reflogs, clearing rererelogs, and deleting unreachable
objects are suitable under the guise of "garbage collection", packing refs
and updating the commit-graph file are not as obviously fitting. Further,
these operations are "all or nothing" in that they rewrite almost all
repository data, which does not perform well at extremely large scales.
These operations can also be disruptive to foreground Git commands when git
gc --auto triggers during routine use.

This series does not intend to change what git gc does, but instead create
new choices for automatic maintenance activities, of which git gc remains
the only one enabled by default.

The new maintenance tasks are:

 * 'commit-graph' : write and verify a single layer of an incremental
   commit-graph.
 * 'loose-objects' : prune packed loose objects, then create a new pack from
   a batch of loose objects.
 * 'pack-files' : expire redundant packs from the multi-pack-index, then
   repack using the multi-pack-index's incremental repack strategy.
 * 'fetch' : fetch from each remote, storing the refs in 'refs/hidden//'.

These tasks are all disabled by default, but can be enabled with config
options or run explicitly using "git maintenance run --task=". There are
additional config options to allow customizing the conditions for which the
tasks run during the '--auto' option. ('fetch' will never run with the
'--auto' option.)

 Because 'gc' is implemented as a maintenance task, the most dramatic change
of this series is to convert the 'git gc --auto' calls into 'git maintenance
run --auto' calls at the end of some Git commands. By default, the only
change is that 'git gc --auto' will be run below an additional 'git
maintenance' process.

The 'git maintenance' builtin has a 'run' subcommand so it can be extended
later with subcommands that manage background maintenance, such as 'start',
'stop', 'pause', or 'schedule'. These are not the subject of this series, as
it is important to focus on the maintenance activities themselves.

An expert user could set up scheduled background maintenance themselves with
the current series. I have the following crontab data set up to run
maintenance on an hourly basis:

0 * * * * git -C /<path-to-repo> maintenance run --no-quiet >>/<path-to-repo>/.git/maintenance.log

My config includes all tasks except the 'gc' task. The hourly run is
over-aggressive, but is sufficient for testing. I'll replace it with daily
when I feel satisfied.

Hopefully this direction is seen as a positive one. My goal was to add more
options for expert users, along with the flexibility to create background
maintenance via the OS in a later series.

OUTLINE
=======

Patches 1-4 remove some references to the_repository in builtin/gc.c before
we start depending on code in that builtin.

Patches 5-7 create the 'git maintenance run' builtin and subcommand as a
simple shim over 'git gc' and replaces calls to 'git gc --auto' from other
commands.

Patches 8-15 create new maintenance tasks. These are the same tasks sent in
the previous RFC.

Patches 16-21 create more customization through config and perform other
polish items.

FUTURE WORK
===========

 * Add 'start', 'stop', and 'schedule' subcommands to initialize the
   commands run in the background.
   
   
 * Split the 'gc' builtin into smaller maintenance tasks that are enabled by
   default, but might have different '--auto' conditions and more config
   options.
   
   
 * Replace config like 'gc.writeCommitGraph' and 'fetch.writeCommitGraph'
   with use of the 'commit-graph' task.
   
   

Thanks, -Stolee

Derrick Stolee (21):
  gc: use the_repository less often
  gc: use repository in too_many_loose_objects()
  gc: use repo config
  gc: drop the_repository in log location
  maintenance: create basic maintenance runner
  maintenance: add --quiet option
  maintenance: replace run_auto_gc()
  maintenance: initialize task array and hashmap
  maintenance: add commit-graph task
  maintenance: add --task option
  maintenance: take a lock on the objects directory
  maintenance: add fetch task
  maintenance: add loose-objects task
  maintenance: add pack-files task
  maintenance: auto-size pack-files batch
  maintenance: create maintenance.<task>.enabled config
  maintenance: use pointers to check --auto
  maintenance: add auto condition for commit-graph task
  maintenance: create auto condition for loose-objects
  maintenance: add pack-files auto condition
  midx: use start_delayed_progress()

 .gitignore                           |   1 +
 Documentation/config.txt             |   2 +
 Documentation/config/maintenance.txt |  32 +
 Documentation/fetch-options.txt      |   5 +-
 Documentation/git-clone.txt          |   7 +-
 Documentation/git-maintenance.txt    | 124 ++++
 builtin.h                            |   1 +
 builtin/am.c                         |   2 +-
 builtin/commit.c                     |   2 +-
 builtin/fetch.c                      |   6 +-
 builtin/gc.c                         | 881 +++++++++++++++++++++++++--
 builtin/merge.c                      |   2 +-
 builtin/rebase.c                     |   4 +-
 commit-graph.c                       |   8 +-
 commit-graph.h                       |   1 +
 config.c                             |  24 +-
 config.h                             |   2 +
 git.c                                |   1 +
 midx.c                               |  12 +-
 midx.h                               |   1 +
 object.h                             |   1 +
 run-command.c                        |   7 +-
 run-command.h                        |   2 +-
 t/t5319-multi-pack-index.sh          |  14 +-
 t/t5510-fetch.sh                     |   2 +-
 t/t5514-fetch-multiple.sh            |   2 +-
 t/t7900-maintenance.sh               | 211 +++++++
 27 files changed, 1265 insertions(+), 92 deletions(-)
 create mode 100644 Documentation/config/maintenance.txt
 create mode 100644 Documentation/git-maintenance.txt
 create mode 100755 t/t7900-maintenance.sh


base-commit: 4a0fcf9f760c9774be77f51e1e88a7499b53d2e2
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-671%2Fderrickstolee%2Fmaintenance%2Fgc-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-671/derrickstolee/maintenance/gc-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/671
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-07 14:21 Derrick Stolee via GitGitGadget
@ 2020-07-08 23:57 ` Emily Shaffer
  2020-07-09 11:21   ` Derrick Stolee
  0 siblings, 1 reply; 14+ messages in thread
From: Emily Shaffer @ 2020-07-08 23:57 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

On Tue, Jul 07, 2020 at 02:21:14PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
> This is a second attempt at redesigning Git's repository maintenance
> patterns. The first attempt [1] included a way to run jobs in the background
> using a long-lived process; that idea was rejected and is not included in
> this series. A future series will use the OS to handle scheduling tasks.
> 
> [1] 
> https://lore.kernel.org/git/pull.597.git.1585946894.gitgitgadget@gmail.com/
> 
> As mentioned before, git gc already plays the role of maintaining Git
> repositories. It has accumulated several smaller pieces in its long history,
> including:
> 
>  1. Repacking all reachable objects into one pack-file (and deleting
>     unreachable objects).
>  2. Packing refs.
>  3. Expiring reflogs.
>  4. Clearing rerere logs.
>  5. Updating the commit-graph file.
> 
> While expiring reflogs, clearing rererelogs, and deleting unreachable
> objects are suitable under the guise of "garbage collection", packing refs
> and updating the commit-graph file are not as obviously fitting. Further,
> these operations are "all or nothing" in that they rewrite almost all
> repository data, which does not perform well at extremely large scales.
> These operations can also be disruptive to foreground Git commands when git
> gc --auto triggers during routine use.
> 
> This series does not intend to change what git gc does, but instead create
> new choices for automatic maintenance activities, of which git gc remains
> the only one enabled by default.
> 
> The new maintenance tasks are:
> 
>  * 'commit-graph' : write and verify a single layer of an incremental
>    commit-graph.
>  * 'loose-objects' : prune packed loose objects, then create a new pack from
>    a batch of loose objects.
>  * 'pack-files' : expire redundant packs from the multi-pack-index, then
>    repack using the multi-pack-index's incremental repack strategy.
>  * 'fetch' : fetch from each remote, storing the refs in 'refs/hidden//'.
> 
> These tasks are all disabled by default, but can be enabled with config
> options or run explicitly using "git maintenance run --task=". There are
> additional config options to allow customizing the conditions for which the
> tasks run during the '--auto' option. ('fetch' will never run with the
> '--auto' option.)
> 
>  Because 'gc' is implemented as a maintenance task, the most dramatic change
> of this series is to convert the 'git gc --auto' calls into 'git maintenance
> run --auto' calls at the end of some Git commands. By default, the only
> change is that 'git gc --auto' will be run below an additional 'git
> maintenance' process.
> 
> The 'git maintenance' builtin has a 'run' subcommand so it can be extended
> later with subcommands that manage background maintenance, such as 'start',
> 'stop', 'pause', or 'schedule'. These are not the subject of this series, as
> it is important to focus on the maintenance activities themselves.
> 
> An expert user could set up scheduled background maintenance themselves with
> the current series. I have the following crontab data set up to run
> maintenance on an hourly basis:
> 
> 0 * * * * git -C /<path-to-repo> maintenance run --no-quiet >>/<path-to-repo>/.git/maintenance.log

One thing I wonder about - now I have to go and make a new crontab
(which is easy) or Task Scheduler task (which is a pain) for every repo,
right?

Is it infeasible to ask for 'git maintenance' to learn something like
'--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
config like "maintenance.targetRepo = /<path-to-repo>"?

> 
> My config includes all tasks except the 'gc' task. The hourly run is
> over-aggressive, but is sufficient for testing. I'll replace it with daily
> when I feel satisfied.
> 
> Hopefully this direction is seen as a positive one. My goal was to add more
> options for expert users, along with the flexibility to create background
> maintenance via the OS in a later series.
> 
> OUTLINE
> =======
> 
> Patches 1-4 remove some references to the_repository in builtin/gc.c before
> we start depending on code in that builtin.
> 
> Patches 5-7 create the 'git maintenance run' builtin and subcommand as a
> simple shim over 'git gc' and replaces calls to 'git gc --auto' from other
> commands.

For me, I'd prefer to see 'git maintenance run' get bigger and 'git gc
--auto' get smaller or disappear. Is there a plan towards that
direction, or is that out of scope for 'git maintenance run'? Similar
examples I can think of include 'git annotate' and 'git pickaxe'.

> 
> Patches 8-15 create new maintenance tasks. These are the same tasks sent in
> the previous RFC.
> 
> Patches 16-21 create more customization through config and perform other
> polish items.
> 
> FUTURE WORK
> ===========
> 
>  * Add 'start', 'stop', and 'schedule' subcommands to initialize the
>    commands run in the background.
>    
>    
>  * Split the 'gc' builtin into smaller maintenance tasks that are enabled by
>    default, but might have different '--auto' conditions and more config
>    options.

Like I mentioned above, for me, I'd rather just see the 'gc' builtin go
away :)

>    
>    
>  * Replace config like 'gc.writeCommitGraph' and 'fetch.writeCommitGraph'
>    with use of the 'commit-graph' task.
>    
>    
> 
> Thanks, -Stolee
> 
> Derrick Stolee (21):
>   gc: use the_repository less often
>   gc: use repository in too_many_loose_objects()
>   gc: use repo config
>   gc: drop the_repository in log location
>   maintenance: create basic maintenance runner
>   maintenance: add --quiet option
>   maintenance: replace run_auto_gc()
>   maintenance: initialize task array and hashmap
>   maintenance: add commit-graph task
>   maintenance: add --task option
>   maintenance: take a lock on the objects directory
>   maintenance: add fetch task
>   maintenance: add loose-objects task
>   maintenance: add pack-files task
>   maintenance: auto-size pack-files batch
>   maintenance: create maintenance.<task>.enabled config
>   maintenance: use pointers to check --auto
>   maintenance: add auto condition for commit-graph task
>   maintenance: create auto condition for loose-objects
>   maintenance: add pack-files auto condition
>   midx: use start_delayed_progress()
> 
>  .gitignore                           |   1 +
>  Documentation/config.txt             |   2 +
>  Documentation/config/maintenance.txt |  32 +
>  Documentation/fetch-options.txt      |   5 +-
>  Documentation/git-clone.txt          |   7 +-
>  Documentation/git-maintenance.txt    | 124 ++++
>  builtin.h                            |   1 +
>  builtin/am.c                         |   2 +-
>  builtin/commit.c                     |   2 +-
>  builtin/fetch.c                      |   6 +-
>  builtin/gc.c                         | 881 +++++++++++++++++++++++++--
>  builtin/merge.c                      |   2 +-
>  builtin/rebase.c                     |   4 +-
>  commit-graph.c                       |   8 +-
>  commit-graph.h                       |   1 +
>  config.c                             |  24 +-
>  config.h                             |   2 +
>  git.c                                |   1 +
>  midx.c                               |  12 +-
>  midx.h                               |   1 +
>  object.h                             |   1 +
>  run-command.c                        |   7 +-
>  run-command.h                        |   2 +-
>  t/t5319-multi-pack-index.sh          |  14 +-
>  t/t5510-fetch.sh                     |   2 +-
>  t/t5514-fetch-multiple.sh            |   2 +-
>  t/t7900-maintenance.sh               | 211 +++++++
>  27 files changed, 1265 insertions(+), 92 deletions(-)
>  create mode 100644 Documentation/config/maintenance.txt
>  create mode 100644 Documentation/git-maintenance.txt
>  create mode 100755 t/t7900-maintenance.sh
> 
> 
> base-commit: 4a0fcf9f760c9774be77f51e1e88a7499b53d2e2
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-671%2Fderrickstolee%2Fmaintenance%2Fgc-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-671/derrickstolee/maintenance/gc-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/671
> -- 
> gitgitgadget

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-08 23:57 ` Emily Shaffer
@ 2020-07-09 11:21   ` Derrick Stolee
  2020-07-09 12:43     ` Derrick Stolee
  2020-07-09 14:05     ` Junio C Hamano
  0 siblings, 2 replies; 14+ messages in thread
From: Derrick Stolee @ 2020-07-09 11:21 UTC (permalink / raw)
  To: Emily Shaffer, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

On 7/8/2020 7:57 PM, Emily Shaffer wrote:
> On Tue, Jul 07, 2020 at 02:21:14PM +0000, Derrick Stolee via GitGitGadget wrote:
>>
>> This is a second attempt at redesigning Git's repository maintenance
>> patterns. The first attempt [1] included a way to run jobs in the background
>> using a long-lived process; that idea was rejected and is not included in
>> this series. A future series will use the OS to handle scheduling tasks.
>>
>> [1] 
>> https://lore.kernel.org/git/pull.597.git.1585946894.gitgitgadget@gmail.com/
>>
>> As mentioned before, git gc already plays the role of maintaining Git
>> repositories. It has accumulated several smaller pieces in its long history,
>> including:
>>
>>  1. Repacking all reachable objects into one pack-file (and deleting
>>     unreachable objects).
>>  2. Packing refs.
>>  3. Expiring reflogs.
>>  4. Clearing rerere logs.
>>  5. Updating the commit-graph file.
>>
>> While expiring reflogs, clearing rererelogs, and deleting unreachable
>> objects are suitable under the guise of "garbage collection", packing refs
>> and updating the commit-graph file are not as obviously fitting. Further,
>> these operations are "all or nothing" in that they rewrite almost all
>> repository data, which does not perform well at extremely large scales.
>> These operations can also be disruptive to foreground Git commands when git
>> gc --auto triggers during routine use.
>>
>> This series does not intend to change what git gc does, but instead create
>> new choices for automatic maintenance activities, of which git gc remains
>> the only one enabled by default.
>>
>> The new maintenance tasks are:
>>
>>  * 'commit-graph' : write and verify a single layer of an incremental
>>    commit-graph.
>>  * 'loose-objects' : prune packed loose objects, then create a new pack from
>>    a batch of loose objects.
>>  * 'pack-files' : expire redundant packs from the multi-pack-index, then
>>    repack using the multi-pack-index's incremental repack strategy.
>>  * 'fetch' : fetch from each remote, storing the refs in 'refs/hidden//'.
>>
>> These tasks are all disabled by default, but can be enabled with config
>> options or run explicitly using "git maintenance run --task=". There are
>> additional config options to allow customizing the conditions for which the
>> tasks run during the '--auto' option. ('fetch' will never run with the
>> '--auto' option.)
>>
>>  Because 'gc' is implemented as a maintenance task, the most dramatic change
>> of this series is to convert the 'git gc --auto' calls into 'git maintenance
>> run --auto' calls at the end of some Git commands. By default, the only
>> change is that 'git gc --auto' will be run below an additional 'git
>> maintenance' process.
>>
>> The 'git maintenance' builtin has a 'run' subcommand so it can be extended
>> later with subcommands that manage background maintenance, such as 'start',
>> 'stop', 'pause', or 'schedule'. These are not the subject of this series, as
>> it is important to focus on the maintenance activities themselves.
>>
>> An expert user could set up scheduled background maintenance themselves with
>> the current series. I have the following crontab data set up to run
>> maintenance on an hourly basis:
>>
>> 0 * * * * git -C /<path-to-repo> maintenance run --no-quiet >>/<path-to-repo>/.git/maintenance.log
> 
> One thing I wonder about - now I have to go and make a new crontab
> (which is easy) or Task Scheduler task (which is a pain) for every repo,
> right?
> 
> Is it infeasible to ask for 'git maintenance' to learn something like
> '--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
> config like "maintenance.targetRepo = /<path-to-repo>"?
> 
>>
>> My config includes all tasks except the 'gc' task. The hourly run is
>> over-aggressive, but is sufficient for testing. I'll replace it with daily
>> when I feel satisfied.
>>
>> Hopefully this direction is seen as a positive one. My goal was to add more
>> options for expert users, along with the flexibility to create background
>> maintenance via the OS in a later series.
>>
>> OUTLINE
>> =======
>>
>> Patches 1-4 remove some references to the_repository in builtin/gc.c before
>> we start depending on code in that builtin.
>>
>> Patches 5-7 create the 'git maintenance run' builtin and subcommand as a
>> simple shim over 'git gc' and replaces calls to 'git gc --auto' from other
>> commands.
> 
> For me, I'd prefer to see 'git maintenance run' get bigger and 'git gc
> --auto' get smaller or disappear. Is there a plan towards that
> direction, or is that out of scope for 'git maintenance run'? Similar
> examples I can think of include 'git annotate' and 'git pickaxe'.

Thanks for these examples of prior work. I'll keep them in mind.

>>  * Split the 'gc' builtin into smaller maintenance tasks that are enabled by
>>    default, but might have different '--auto' conditions and more config
>>    options.
> 
> Like I mentioned above, for me, I'd rather just see the 'gc' builtin go
> away :)

My hope is that we can absolutely do that. I didn't want to start that
exercise yet, as I don't want to disrupt existing workflows more than
I already am.

It is important to recognize that there are already several "tasks" that
run inside 'gc' including:

1. Expiring reflogs.
2. Repacking all reachable objects.
3. Deleting unreachable objects.
4. Packing refs.

Before trying to "remove" the gc builtin, we would want these to be
represented in the 'git maintenance run' as tasks.

In that direction, I realized after submitting that I should rename
the 'pack-files' task in this submission to 'incremental-repack'
instead, allowing a later 'full-repack' task to represent the role
of that step in the 'gc' task. Some users will prefer one over the
other. Perhaps this incremental/full distinction makes it clear that
there are trade-offs in both directions.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 11:21   ` Derrick Stolee
@ 2020-07-09 12:43     ` Derrick Stolee
  2020-07-09 23:16       ` Jeff King
  2020-07-09 14:05     ` Junio C Hamano
  1 sibling, 1 reply; 14+ messages in thread
From: Derrick Stolee @ 2020-07-09 12:43 UTC (permalink / raw)
  To: Emily Shaffer, Derrick Stolee via GitGitGadget
  Cc: git, Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

On 7/9/2020 7:21 AM, Derrick Stolee wrote:
> On 7/8/2020 7:57 PM, Emily Shaffer wrote:
>> On Tue, Jul 07, 2020 at 02:21:14PM +0000, Derrick Stolee via GitGitGadget wrote:
>>> An expert user could set up scheduled background maintenance themselves with
>>> the current series. I have the following crontab data set up to run
>>> maintenance on an hourly basis:
>>>
>>> 0 * * * * git -C /<path-to-repo> maintenance run --no-quiet >>/<path-to-repo>/.git/maintenance.log
>>
>> One thing I wonder about - now I have to go and make a new crontab
>> (which is easy) or Task Scheduler task (which is a pain) for every repo,
>> right?
>>
>> Is it infeasible to ask for 'git maintenance' to learn something like
>> '--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
>> config like "maintenance.targetRepo = /<path-to-repo>"?

Sorry that I missed this comment on my first reply.

The intention is that this cron entry will be simpler after I follow up
with the "background" part of maintenance. The idea is to use global
or system config to register a list of repositories that want background
maintenance and have cron execute something like "git maintenance run --all-repos"
to span "git -C <repo> maintenance run --scheduled" for all repos in
the config.

For now, this manual setup does end up a bit cluttered if you have a
lot of repos to maintain.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 11:21   ` Derrick Stolee
  2020-07-09 12:43     ` Derrick Stolee
@ 2020-07-09 14:05     ` Junio C Hamano
  2020-07-09 15:54       ` Derrick Stolee
  1 sibling, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2020-07-09 14:05 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Emily Shaffer, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> It is important to recognize that there are already several "tasks" that
> run inside 'gc' including:
>
> 1. Expiring reflogs.
> 2. Repacking all reachable objects.
> 3. Deleting unreachable objects.
> 4. Packing refs.
>
> Before trying to "remove" the gc builtin, we would want these to be
> represented in the 'git maintenance run' as tasks.

Yup.  I like the overall direction of this approach to (1) have a
single subcommand that helps all the housekeeping tasks, and to (2)
make sure existing housekeeping tasks are supported by the new one.

I can understand why it is tempting to start with a new 'main()'
under a new subcommand name because we expect to add a lot more
tasks, but the name of that subcommand is much less important.

As can be seen in the list you have above, "gc" already does a lot
more than garbage collection (just #3 is the "garbage collection"
proper), as it has grown by following the same approach.

What's more important is (2) above.  While the tool has grown under
the same "gc" name, it was easier to arrange---it fell out naturally
as a consequence of the development being an enhancement on top of
the prior work.  Now that we are reimplementing, we need to actively
care.  As long as we recognize that, I am perfectly happy with the
current effort.

For existing callers, "git gc --auto" may want to be left alive,
merely as a thin wrapper around "git maintenance --auto", and as
long as the latter is done in the same spirit of the former, i.e.
perform a lightweight check to see if the repository is so out of
shape and then do a minimum cleaning, it would be welcomed by users
if it does a lot more than the current "git gc --auto".

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 14:05     ` Junio C Hamano
@ 2020-07-09 15:54       ` Derrick Stolee
  2020-07-09 16:26         ` Junio C Hamano
  0 siblings, 1 reply; 14+ messages in thread
From: Derrick Stolee @ 2020-07-09 15:54 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Emily Shaffer, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

On 7/9/2020 10:05 AM, Junio C Hamano wrote:
> For existing callers, "git gc --auto" may want to be left alive,
> merely as a thin wrapper around "git maintenance --auto", and as
> long as the latter is done in the same spirit of the former, i.e.
> perform a lightweight check to see if the repository is so out of
> shape and then do a minimum cleaning, it would be welcomed by users
> if it does a lot more than the current "git gc --auto".

It's entirely possible that (after the 'maintenance' builtin
stabilizes) that we make 'git gc --auto' become an alias of something
like 'git maintenance run --task=gc --auto' (or itemize all of the
sub-tasks) so that 'git gc --auto' doesn't change behavior.

That's a big motivation for adding all code into builtin/gc.c so
we can access these tasks inside GC without needing to move or
copy the code. I'm trying to preserve history as much as possible.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 15:54       ` Derrick Stolee
@ 2020-07-09 16:26         ` Junio C Hamano
  2020-07-09 16:56           ` Derrick Stolee
  0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2020-07-09 16:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Emily Shaffer, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> On 7/9/2020 10:05 AM, Junio C Hamano wrote:
>> For existing callers, "git gc --auto" may want to be left alive,
>> merely as a thin wrapper around "git maintenance --auto", and as
>> long as the latter is done in the same spirit of the former, i.e.
>> perform a lightweight check to see if the repository is so out of
>> shape and then do a minimum cleaning, it would be welcomed by users
>> if it does a lot more than the current "git gc --auto".
>
> It's entirely possible that (after the 'maintenance' builtin
> stabilizes) that we make 'git gc --auto' become an alias of something
> like 'git maintenance run --task=gc --auto' (or itemize all of the
> sub-tasks) so that 'git gc --auto' doesn't change behavior.

Yes, it is possible, but I doubt it is desirable.

The current users of "gc --auto" do not (and should not) care the
details of what tasks are performed.  We surely have added more
stuff that need maintenance since "gc --auto" was originally
written, and after people have started using "gc --auto" in their
workflows.  For example, I think "gc --auto" predates "rerere gc"
and those who had "gc --auto" in their script had a moment when
suddenly it started to clean stale entries in the rerere database.

Were they got upset when it happened?  Will they get upset when it
starts cleaning up stale commit-graph leftover files?

As long as "gc --auto" kept the same spirit of doing a lightweight
check to see if the repository is so out of shape to require
cleaning and performing a minimum maintenance when it started
calling "rerere gc", and as long as "maintenance --auto" does the
same, I would think the users would be delighted without complaints.

So, I wouldn't worry too much about what exactly happens with the
future versions of "gc --auto".  The world has changed, and we have
more items in the repository that needs maintenance/cruft removal.
The command in the new world should deal with these new stuff, too.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 16:26         ` Junio C Hamano
@ 2020-07-09 16:56           ` Derrick Stolee
  0 siblings, 0 replies; 14+ messages in thread
From: Derrick Stolee @ 2020-07-09 16:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Emily Shaffer, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, peff,
	congdanhqx, phillip.wood123, Derrick Stolee

On 7/9/2020 12:26 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> On 7/9/2020 10:05 AM, Junio C Hamano wrote:
>>> For existing callers, "git gc --auto" may want to be left alive,
>>> merely as a thin wrapper around "git maintenance --auto", and as
>>> long as the latter is done in the same spirit of the former, i.e.
>>> perform a lightweight check to see if the repository is so out of
>>> shape and then do a minimum cleaning, it would be welcomed by users
>>> if it does a lot more than the current "git gc --auto".
>>
>> It's entirely possible that (after the 'maintenance' builtin
>> stabilizes) that we make 'git gc --auto' become an alias of something
>> like 'git maintenance run --task=gc --auto' (or itemize all of the
>> sub-tasks) so that 'git gc --auto' doesn't change behavior.
> 
> Yes, it is possible, but I doubt it is desirable.
> 
> The current users of "gc --auto" do not (and should not) care the
> details of what tasks are performed.  We surely have added more
> stuff that need maintenance since "gc --auto" was originally
> written, and after people have started using "gc --auto" in their
> workflows.  For example, I think "gc --auto" predates "rerere gc"
> and those who had "gc --auto" in their script had a moment when
> suddenly it started to clean stale entries in the rerere database.
> 
> Were they got upset when it happened?  Will they get upset when it
> starts cleaning up stale commit-graph leftover files?
> 
> As long as "gc --auto" kept the same spirit of doing a lightweight
> check to see if the repository is so out of shape to require
> cleaning and performing a minimum maintenance when it started
> calling "rerere gc", and as long as "maintenance --auto" does the
> same, I would think the users would be delighted without complaints.
> 
> So, I wouldn't worry too much about what exactly happens with the
> future versions of "gc --auto".  The world has changed, and we have
> more items in the repository that needs maintenance/cruft removal.
> The command in the new world should deal with these new stuff, too.

Sounds good to me. The extra context around this helps a lot!

-Stolee

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 12:43     ` Derrick Stolee
@ 2020-07-09 23:16       ` Jeff King
  2020-07-09 23:45         ` Derrick Stolee
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2020-07-09 23:16 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Emily Shaffer, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, congdanhqx,
	phillip.wood123, Derrick Stolee

On Thu, Jul 09, 2020 at 08:43:48AM -0400, Derrick Stolee wrote:

> >> Is it infeasible to ask for 'git maintenance' to learn something like
> >> '--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
> >> config like "maintenance.targetRepo = /<path-to-repo>"?
> 
> Sorry that I missed this comment on my first reply.
> 
> The intention is that this cron entry will be simpler after I follow up
> with the "background" part of maintenance. The idea is to use global
> or system config to register a list of repositories that want background
> maintenance and have cron execute something like "git maintenance run --all-repos"
> to span "git -C <repo> maintenance run --scheduled" for all repos in
> the config.
> 
> For now, this manual setup does end up a bit cluttered if you have a
> lot of repos to maintain.

I think it might be useful to have a general command to run a subcommand
in a bunch of repositories. Something like:

  git for-each-repo --recurse /path/to/repos git maintenance ...

which would root around in /path/to/repos for any git-dirs and run "git
--git-dir=$GIT_DIR maintenance ..." on each of them.

And/or:

  git for-each-repo --config maintenance.repos git maintenance ...

which would pull the set of repos from the named config variable instead
of looking around the filesystem.

You could use either as a one-liner in the crontab (depending on which
is easier with your repo layout).

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 23:16       ` Jeff King
@ 2020-07-09 23:45         ` Derrick Stolee
  2020-07-10 18:46           ` Emily Shaffer
  0 siblings, 1 reply; 14+ messages in thread
From: Derrick Stolee @ 2020-07-09 23:45 UTC (permalink / raw)
  To: Jeff King
  Cc: Emily Shaffer, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, congdanhqx,
	phillip.wood123, Derrick Stolee

On 7/9/2020 7:16 PM, Jeff King wrote:
> On Thu, Jul 09, 2020 at 08:43:48AM -0400, Derrick Stolee wrote:
> 
>>>> Is it infeasible to ask for 'git maintenance' to learn something like
>>>> '--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
>>>> config like "maintenance.targetRepo = /<path-to-repo>"?
>>
>> Sorry that I missed this comment on my first reply.
>>
>> The intention is that this cron entry will be simpler after I follow up
>> with the "background" part of maintenance. The idea is to use global
>> or system config to register a list of repositories that want background
>> maintenance and have cron execute something like "git maintenance run --all-repos"
>> to span "git -C <repo> maintenance run --scheduled" for all repos in
>> the config.
>>
>> For now, this manual setup does end up a bit cluttered if you have a
>> lot of repos to maintain.
> 
> I think it might be useful to have a general command to run a subcommand
> in a bunch of repositories. Something like:
> 
>   git for-each-repo --recurse /path/to/repos git maintenance ...
> 
> which would root around in /path/to/repos for any git-dirs and run "git
> --git-dir=$GIT_DIR maintenance ..." on each of them.
> 
> And/or:
> 
>   git for-each-repo --config maintenance.repos git maintenance ...
> 
> which would pull the set of repos from the named config variable instead
> of looking around the filesystem.

Yes! This! That's a good way to make something generic that solves
the problem at hand, but might also have other applications! Most
excellent.

> You could use either as a one-liner in the crontab (depending on which
> is easier with your repo layout).

The hope is that we can have such a clean layout. I'm particularly
fond of the config option because users may want to opt-in to
background maintenance only on some repos, even if they put them
in a consistent location.

In the _far_ future, we might even want to add a repo to this
"maintenance.repos" list during 'git init' and 'git clone' so
this is automatic. It then becomes opt-out at that point, which
is why I saw the _far, far_ future.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-09 23:45         ` Derrick Stolee
@ 2020-07-10 18:46           ` Emily Shaffer
  2020-07-10 19:30             ` Son Luong Ngoc
  0 siblings, 1 reply; 14+ messages in thread
From: Emily Shaffer @ 2020-07-10 18:46 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Jeff King, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, congdanhqx,
	phillip.wood123, Derrick Stolee

On Thu, Jul 09, 2020 at 07:45:47PM -0400, Derrick Stolee wrote:
> 
> On 7/9/2020 7:16 PM, Jeff King wrote:
> > On Thu, Jul 09, 2020 at 08:43:48AM -0400, Derrick Stolee wrote:
> > 
> >>>> Is it infeasible to ask for 'git maintenance' to learn something like
> >>>> '--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
> >>>> config like "maintenance.targetRepo = /<path-to-repo>"?
> >>
> >> Sorry that I missed this comment on my first reply.
> >>
> >> The intention is that this cron entry will be simpler after I follow up
> >> with the "background" part of maintenance. The idea is to use global
> >> or system config to register a list of repositories that want background
> >> maintenance and have cron execute something like "git maintenance run --all-repos"
> >> to span "git -C <repo> maintenance run --scheduled" for all repos in
> >> the config.
> >>
> >> For now, this manual setup does end up a bit cluttered if you have a
> >> lot of repos to maintain.
> > 
> > I think it might be useful to have a general command to run a subcommand
> > in a bunch of repositories. Something like:
> > 
> >   git for-each-repo --recurse /path/to/repos git maintenance ...
> > 
> > which would root around in /path/to/repos for any git-dirs and run "git
> > --git-dir=$GIT_DIR maintenance ..." on each of them.
> > 
> > And/or:
> > 
> >   git for-each-repo --config maintenance.repos git maintenance ...
> > 
> > which would pull the set of repos from the named config variable instead
> > of looking around the filesystem.
> 
> Yes! This! That's a good way to make something generic that solves
> the problem at hand, but might also have other applications! Most
> excellent.

I'm glad I wasn't the only one super geeked when I read this idea. I'd
use the heck out of this in my .bashrc too. Sounds awesome. I actually
had a short-lived fling last year with a script to summarize my
uncommitted changes in all repos at the beginning of every session
(dropped because it became one more thing to gloss over) and could have
really used this command.

> 
> > You could use either as a one-liner in the crontab (depending on which
> > is easier with your repo layout).
> 
> The hope is that we can have such a clean layout. I'm particularly
> fond of the config option because users may want to opt-in to
> background maintenance only on some repos, even if they put them
> in a consistent location.
> 
> In the _far_ future, we might even want to add a repo to this
> "maintenance.repos" list during 'git init' and 'git clone' so
> this is automatic. It then becomes opt-out at that point, which
> is why I saw the _far, far_ future.

Oh, I like this idea a lot. Then I can do something silly like

  alias reproclone="git clone --no-maintainenance"

and get the benefits on everything else that I plan to be using
frequently.

 - Emily

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-10 18:46           ` Emily Shaffer
@ 2020-07-10 19:30             ` Son Luong Ngoc
  0 siblings, 0 replies; 14+ messages in thread
From: Son Luong Ngoc @ 2020-07-10 19:30 UTC (permalink / raw)
  To: Emily Shaffer
  Cc: Derrick Stolee, Jeff King, Derrick Stolee via GitGitGadget, git,
	Johannes.Schindelin, sandals, steadmon, jrnieder, congdanhqx,
	phillip.wood123, Derrick Stolee



> On Jul 10, 2020, at 20:46, Emily Shaffer <emilyshaffer@google.com> wrote:
> 
> On Thu, Jul 09, 2020 at 07:45:47PM -0400, Derrick Stolee wrote:
>> 
>> On 7/9/2020 7:16 PM, Jeff King wrote:
>>> On Thu, Jul 09, 2020 at 08:43:48AM -0400, Derrick Stolee wrote:
>>> 
>>>>>> Is it infeasible to ask for 'git maintenance' to learn something like
>>>>>> '--on /<path-to-repo> --on /<path-to-second-repo>'? Or better yet, some
>>>>>> config like "maintenance.targetRepo = /<path-to-repo>"?
>>>> 
>>>> Sorry that I missed this comment on my first reply.
>>>> 
>>>> The intention is that this cron entry will be simpler after I follow up
>>>> with the "background" part of maintenance. The idea is to use global
>>>> or system config to register a list of repositories that want background
>>>> maintenance and have cron execute something like "git maintenance run --all-repos"
>>>> to span "git -C <repo> maintenance run --scheduled" for all repos in
>>>> the config.
>>>> 
>>>> For now, this manual setup does end up a bit cluttered if you have a
>>>> lot of repos to maintain.
>>> 
>>> I think it might be useful to have a general command to run a subcommand
>>> in a bunch of repositories. Something like:
>>> 
>>>  git for-each-repo --recurse /path/to/repos git maintenance ...
>>> 
>>> which would root around in /path/to/repos for any git-dirs and run "git
>>> --git-dir=$GIT_DIR maintenance ..." on each of them.
>>> 
>>> And/or:
>>> 
>>>  git for-each-repo --config maintenance.repos git maintenance ...
>>> 
>>> which would pull the set of repos from the named config variable instead
>>> of looking around the filesystem.
>> 
>> Yes! This! That's a good way to make something generic that solves
>> the problem at hand, but might also have other applications! Most
>> excellent.
> 
> I'm glad I wasn't the only one super geeked when I read this idea. I'd
> use the heck out of this in my .bashrc too. Sounds awesome. I actually
> had a short-lived fling last year with a script to summarize my
> uncommitted changes in all repos at the beginning of every session
> (dropped because it became one more thing to gloss over) and could have
> really used this command.

I was planning to build a CLI tool that help manage multiple repos maintenance
like what was just described here.
My experience using my poor-man-scalar [1] bash script is: For multiple repositories,
the process count could get out of control quite quickly and there are probably other
issues that I have not thought of / encountered...

There is definitely a need to keep all the repos updated with pre-fetch 
and updated commit-graph, while staying compact / garbage free.
Having this in Git does simplify a lot of daily operations for end users.

> 
>> 
>>> You could use either as a one-liner in the crontab (depending on which
>>> is easier with your repo layout).
>> 
>> The hope is that we can have such a clean layout. I'm particularly
>> fond of the config option because users may want to opt-in to
>> background maintenance only on some repos, even if they put them
>> in a consistent location.
>> 
>> In the _far_ future, we might even want to add a repo to this
>> "maintenance.repos" list during 'git init' and 'git clone' so
>> this is automatic. It then becomes opt-out at that point, which
>> is why I saw the _far, far_ future.
> 
> Oh, I like this idea a lot. Then I can do something silly like
> 
>  alias reproclone="git clone --no-maintainenance"
> 
> and get the benefits on everything else that I plan to be using
> frequently.

This started to remind me of automatic updates in some of the popular OS.
Where download/install/cleanup update of multiple software components are
managed under a single tool.

I wonder if this is the path git should take in the 'new world' that Junio mentioned. [2]

But I am also super geeked reading this. :)

> 
> - Emily

Regards,
Son Luong.

[1]: https://github.com/sluongng/git-care
[2]: https://lore.kernel.org/git/xmqqmu48y7rw.fsf@gitster.c.googlers.com/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
@ 2020-07-13  6:18 Son Luong Ngoc
  2020-07-14 13:46 ` Derrick Stolee
  0 siblings, 1 reply; 14+ messages in thread
From: Son Luong Ngoc @ 2020-07-13  6:18 UTC (permalink / raw)
  To: gitgitgadget
  Cc: Johannes.Schindelin, congdanhqx, derrickstolee, git, jrnieder,
	peff, phillip.wood123, sandals, steadmon

Hi Derrick,

> This is a second attempt at redesigning Git's repository maintenance
> patterns. The first attempt [1] included a way to run jobs in the background
> using a long-lived process; that idea was rejected and is not included in
> this series. A future series will use the OS to handle scheduling tasks.
>
> [1]
> https://lore.kernel.org/git/pull.597.git.1585946894.gitgitgadget@gmail.com/
>
> As mentioned before, git gc already plays the role of maintaining Git
> repositories. It has accumulated several smaller pieces in its long history,
> including:
>
>  1. Repacking all reachable objects into one pack-file (and deleting
>     unreachable objects).
>  2. Packing refs.
>  3. Expiring reflogs.
>  4. Clearing rerere logs.
>  5. Updating the commit-graph file.

It's worth mentioning 'git worktree prune' as well.

>
> While expiring reflogs, clearing rererelogs, and deleting unreachable
> objects are suitable under the guise of "garbage collection", packing refs
> and updating the commit-graph file are not as obviously fitting. Further,
> these operations are "all or nothing" in that they rewrite almost all
> repository data, which does not perform well at extremely large scales.
> These operations can also be disruptive to foreground Git commands when git
> gc --auto triggers during routine use.
>
> This series does not intend to change what git gc does, but instead create
> new choices for automatic maintenance activities, of which git gc remains
> the only one enabled by default.
>
> The new maintenance tasks are:
>
>  * 'commit-graph' : write and verify a single layer of an incremental
>    commit-graph.
>  * 'loose-objects' : prune packed loose objects, then create a new pack from
>    a batch of loose objects.
>  * 'pack-files' : expire redundant packs from the multi-pack-index, then
>    repack using the multi-pack-index's incremental repack strategy.
>  * 'fetch' : fetch from each remote, storing the refs in 'refs/hidden//'.

As some of the previous discussions [1] have raised, I think 'prefetch' would
communicate the refs' purpose better than just 'hidden'.
In-fact, I would suggest naming the task 'prefetch' instead, just to avoid
potential communication issue between 'git fetch' and 'git maintenance fetch'.

[1]: https://lore.kernel.org/git/xmqqeet1y8wy.fsf@gitster.c.googlers.com/

>
> These tasks are all disabled by default, but can be enabled with config
> options or run explicitly using "git maintenance run --task=". There are
> additional config options to allow customizing the conditions for which the
> tasks run during the '--auto' option. ('fetch' will never run with the
> '--auto' option.)
>
>  Because 'gc' is implemented as a maintenance task, the most dramatic change
> of this series is to convert the 'git gc --auto' calls into 'git maintenance
> run --auto' calls at the end of some Git commands. By default, the only
> change is that 'git gc --auto' will be run below an additional 'git
> maintenance' process.
>
> The 'git maintenance' builtin has a 'run' subcommand so it can be extended
> later with subcommands that manage background maintenance, such as 'start',
> 'stop', 'pause', or 'schedule'. These are not the subject of this series, as
> it is important to focus on the maintenance activities themselves.
>
> An expert user could set up scheduled background maintenance themselves with
> the current series. I have the following crontab data set up to run
> maintenance on an hourly basis:
>
> 0 * * * * git -C /<path-to-repo> maintenance run --no-quiet >>/<path-to-repo>/.git/maintenance.log

Perhaps the logging should be included inside the maintenance command instead
of relying on the append here?
Given that we have 'gc.log', I would imagine 'maintenance.log' is not
too far-fetched?

>
> My config includes all tasks except the 'gc' task. The hourly run is
> over-aggressive, but is sufficient for testing. I'll replace it with daily
> when I feel satisfied.
>
> Hopefully this direction is seen as a positive one. My goal was to add more
> options for expert users, along with the flexibility to create background
> maintenance via the OS in a later series.
>
> OUTLINE
> =======
>
> Patches 1-4 remove some references to the_repository in builtin/gc.c before
> we start depending on code in that builtin.
>
> Patches 5-7 create the 'git maintenance run' builtin and subcommand as a
> simple shim over 'git gc' and replaces calls to 'git gc --auto' from other
> commands.
>
> Patches 8-15 create new maintenance tasks. These are the same tasks sent in
> the previous RFC.
>
> Patches 16-21 create more customization through config and perform other
> polish items.
>
> FUTURE WORK
> ===========
>
>  * Add 'start', 'stop', and 'schedule' subcommands to initialize the
>    commands run in the background.
>
>
>  * Split the 'gc' builtin into smaller maintenance tasks that are enabled by
>    default, but might have different '--auto' conditions and more config
>    options.
>
>
>  * Replace config like 'gc.writeCommitGraph' and 'fetch.writeCommitGraph'
>    with use of the 'commit-graph' task.
>
>
>
> Thanks, -Stolee

Thanks,
Son Luong.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization
  2020-07-13  6:18 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Son Luong Ngoc
@ 2020-07-14 13:46 ` Derrick Stolee
  0 siblings, 0 replies; 14+ messages in thread
From: Derrick Stolee @ 2020-07-14 13:46 UTC (permalink / raw)
  To: Son Luong Ngoc, gitgitgadget
  Cc: Johannes.Schindelin, congdanhqx, derrickstolee, git, jrnieder,
	peff, phillip.wood123, sandals, steadmon

On 7/13/2020 2:18 AM, Son Luong Ngoc wrote:
> Hi Derrick,
> 
>> This is a second attempt at redesigning Git's repository maintenance
>> patterns. The first attempt [1] included a way to run jobs in the background
>> using a long-lived process; that idea was rejected and is not included in
>> this series. A future series will use the OS to handle scheduling tasks.
>>
>> [1]
>> https://lore.kernel.org/git/pull.597.git.1585946894.gitgitgadget@gmail.com/
>>
>> As mentioned before, git gc already plays the role of maintaining Git
>> repositories. It has accumulated several smaller pieces in its long history,
>> including:
>>
>>  1. Repacking all reachable objects into one pack-file (and deleting
>>     unreachable objects).
>>  2. Packing refs.
>>  3. Expiring reflogs.
>>  4. Clearing rerere logs.
>>  5. Updating the commit-graph file.
> 
> It's worth mentioning 'git worktree prune' as well.

Good point. I'll also say "including, but not limited to:"

>> While expiring reflogs, clearing rererelogs, and deleting unreachable
>> objects are suitable under the guise of "garbage collection", packing refs
>> and updating the commit-graph file are not as obviously fitting. Further,
>> these operations are "all or nothing" in that they rewrite almost all
>> repository data, which does not perform well at extremely large scales.
>> These operations can also be disruptive to foreground Git commands when git
>> gc --auto triggers during routine use.
>>
>> This series does not intend to change what git gc does, but instead create
>> new choices for automatic maintenance activities, of which git gc remains
>> the only one enabled by default.
>>
>> The new maintenance tasks are:
>>
>>  * 'commit-graph' : write and verify a single layer of an incremental
>>    commit-graph.
>>  * 'loose-objects' : prune packed loose objects, then create a new pack from
>>    a batch of loose objects.
>>  * 'pack-files' : expire redundant packs from the multi-pack-index, then
>>    repack using the multi-pack-index's incremental repack strategy.
>>  * 'fetch' : fetch from each remote, storing the refs in 'refs/hidden//'.
> 
> As some of the previous discussions [1] have raised, I think 'prefetch' would
> communicate the refs' purpose better than just 'hidden'.
> In-fact, I would suggest naming the task 'prefetch' instead, just to avoid
> potential communication issue between 'git fetch' and 'git maintenance fetch'.
> 
> [1]: https://lore.kernel.org/git/xmqqeet1y8wy.fsf@gitster.c.googlers.com/

Thanks for the reminder. I'll rename the task as you suggest.

>> These tasks are all disabled by default, but can be enabled with config
>> options or run explicitly using "git maintenance run --task=". There are
>> additional config options to allow customizing the conditions for which the
>> tasks run during the '--auto' option. ('fetch' will never run with the
>> '--auto' option.)
>>
>>  Because 'gc' is implemented as a maintenance task, the most dramatic change
>> of this series is to convert the 'git gc --auto' calls into 'git maintenance
>> run --auto' calls at the end of some Git commands. By default, the only
>> change is that 'git gc --auto' will be run below an additional 'git
>> maintenance' process.
>>
>> The 'git maintenance' builtin has a 'run' subcommand so it can be extended
>> later with subcommands that manage background maintenance, such as 'start',
>> 'stop', 'pause', or 'schedule'. These are not the subject of this series, as
>> it is important to focus on the maintenance activities themselves.
>>
>> An expert user could set up scheduled background maintenance themselves with
>> the current series. I have the following crontab data set up to run
>> maintenance on an hourly basis:
>>
>> 0 * * * * git -C /<path-to-repo> maintenance run --no-quiet >>/<path-to-repo>/.git/maintenance.log
> 
> Perhaps the logging should be included inside the maintenance command instead
> of relying on the append here?
> Given that we have 'gc.log', I would imagine 'maintenance.log' is not
> too far-fetched?

I'll research gc.log and how that works.

>> My config includes all tasks except the 'gc' task. The hourly run is
>> over-aggressive, but is sufficient for testing. I'll replace it with daily
>> when I feel satisfied.
>>
>> Hopefully this direction is seen as a positive one. My goal was to add more
>> options for expert users, along with the flexibility to create background
>> maintenance via the OS in a later series.
>>
>> OUTLINE
>> =======
>>
>> Patches 1-4 remove some references to the_repository in builtin/gc.c before
>> we start depending on code in that builtin.
>>
>> Patches 5-7 create the 'git maintenance run' builtin and subcommand as a
>> simple shim over 'git gc' and replaces calls to 'git gc --auto' from other
>> commands.
>>
>> Patches 8-15 create new maintenance tasks. These are the same tasks sent in
>> the previous RFC.
>>
>> Patches 16-21 create more customization through config and perform other
>> polish items.
>>
>> FUTURE WORK
>> ===========
>>
>>  * Add 'start', 'stop', and 'schedule' subcommands to initialize the
>>    commands run in the background.
>>
>>
>>  * Split the 'gc' builtin into smaller maintenance tasks that are enabled by
>>    default, but might have different '--auto' conditions and more config
>>    options.
>>
>>
>>  * Replace config like 'gc.writeCommitGraph' and 'fetch.writeCommitGraph'
>>    with use of the 'commit-graph' task.
>>
>>
>>
>> Thanks, -Stolee
> 
> Thanks,
> Son Luong.

Thank you!
-Stolee


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-07-14 13:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-13  6:18 [PATCH 00/21] Maintenance builtin, allowing 'gc --auto' customization Son Luong Ngoc
2020-07-14 13:46 ` Derrick Stolee
  -- strict thread matches above, loose matches on Subject: below --
2020-07-07 14:21 Derrick Stolee via GitGitGadget
2020-07-08 23:57 ` Emily Shaffer
2020-07-09 11:21   ` Derrick Stolee
2020-07-09 12:43     ` Derrick Stolee
2020-07-09 23:16       ` Jeff King
2020-07-09 23:45         ` Derrick Stolee
2020-07-10 18:46           ` Emily Shaffer
2020-07-10 19:30             ` Son Luong Ngoc
2020-07-09 14:05     ` Junio C Hamano
2020-07-09 15:54       ` Derrick Stolee
2020-07-09 16:26         ` Junio C Hamano
2020-07-09 16:56           ` Derrick Stolee

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).