From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: sandals@crustytoothpaste.net, steadmon@google.com,
jrnieder@gmail.com, peff@peff.net, congdanhqx@gmail.com,
phillip.wood123@gmail.com, emilyshaffer@google.com,
sluongng@gmail.com, jonathantanmy@google.com,
Jonathan Tan <jonathantanmy@google.com>,
Derrick Stolee <stolee@gmail.com>,
Derrick Stolee <derrickstolee@github.com>,
Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH v4 2/8] maintenance: add loose-objects task
Date: Fri, 25 Sep 2020 12:33:32 +0000 [thread overview]
Message-ID: <f3a16fd324a6b74ce93e8b764553ff7f4705b42e.1601037218.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.696.v4.git.1601037218.gitgitgadget@gmail.com>
From: Derrick Stolee <dstolee@microsoft.com>
One goal of background maintenance jobs is to allow a user to
disable auto-gc (gc.auto=0) but keep their repository in a clean
state. Without any cleanup, loose objects will clutter the object
database and slow operations. In addition, the loose objects will
take up extra space because they are not stored with deltas against
similar objects.
Create a 'loose-objects' task for the 'git maintenance run' command.
This helps clean up loose objects without disrupting concurrent Git
commands using the following sequence of events:
1. Run 'git prune-packed' to delete any loose objects that exist
in a pack-file. Concurrent commands will prefer the packed
version of the object to the loose version. (Of course, there
are exceptions for commands that specifically care about the
location of an object. These are rare for a user to run on
purpose, and we hope a user that has selected background
maintenance will not be trying to do foreground maintenance.)
2. Run 'git pack-objects' on a batch of loose objects. These
objects are grouped by scanning the loose object directories in
lexicographic order until listing all loose objects -or-
reaching 50,000 objects. This is more than enough if the loose
objects are created only by a user doing normal development.
We noticed users with _millions_ of loose objects because VFS
for Git downloads blobs on-demand when a file read operation
requires populating a virtual file.
This step is based on a similar step in Scalar [1] and VFS for Git.
[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/LooseObjectsStep.cs
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
Documentation/git-maintenance.txt | 15 +++++
builtin/gc.c | 97 +++++++++++++++++++++++++++++++
t/t7900-maintenance.sh | 39 +++++++++++++
3 files changed, 151 insertions(+)
diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index 12668fccf7..fc95eb594f 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -70,6 +70,21 @@ gc::
be disruptive in some situations, as it deletes stale data. See
linkgit:git-gc[1] for more details on garbage collection in Git.
+loose-objects::
+ The `loose-objects` job cleans up loose objects and places them into
+ pack-files. In order to prevent race conditions with concurrent Git
+ commands, it follows a two-step process. First, it deletes any loose
+ objects that already exist in a pack-file; concurrent Git processes
+ will examine the pack-file for the object data instead of the loose
+ object. Second, it creates a new pack-file (starting with "loose-")
+ containing a batch of loose objects. The batch size is limited to 50
+ thousand objects to prevent the job from taking too long on a
+ repository with many loose objects. The `gc` task writes unreachable
+ objects as loose objects to be cleaned up by a later step only if
+ they are not re-added to a pack-file; for this reason it is not
+ advisable to enable both the `loose-objects` and `gc` tasks at the
+ same time.
+
OPTIONS
-------
--auto::
diff --git a/builtin/gc.c b/builtin/gc.c
index 5e469488f4..c9db8555b9 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -880,6 +880,98 @@ static int maintenance_task_gc(struct maintenance_run_opts *opts)
return run_command(&child);
}
+static int prune_packed(struct maintenance_run_opts *opts)
+{
+ struct child_process child = CHILD_PROCESS_INIT;
+
+ child.git_cmd = 1;
+ strvec_push(&child.args, "prune-packed");
+
+ if (opts->quiet)
+ strvec_push(&child.args, "--quiet");
+
+ return !!run_command(&child);
+}
+
+struct write_loose_object_data {
+ FILE *in;
+ int count;
+ int batch_size;
+};
+
+static int bail_on_loose(const struct object_id *oid,
+ const char *path,
+ void *data)
+{
+ return 1;
+}
+
+static int write_loose_object_to_stdin(const struct object_id *oid,
+ const char *path,
+ void *data)
+{
+ struct write_loose_object_data *d = (struct write_loose_object_data *)data;
+
+ fprintf(d->in, "%s\n", oid_to_hex(oid));
+
+ return ++(d->count) > d->batch_size;
+}
+
+static int pack_loose(struct maintenance_run_opts *opts)
+{
+ struct repository *r = the_repository;
+ int result = 0;
+ struct write_loose_object_data data;
+ struct child_process pack_proc = CHILD_PROCESS_INIT;
+
+ /*
+ * Do not start pack-objects process
+ * if there are no loose objects.
+ */
+ if (!for_each_loose_file_in_objdir(r->objects->odb->path,
+ bail_on_loose,
+ NULL, NULL, NULL))
+ return 0;
+
+ pack_proc.git_cmd = 1;
+
+ strvec_push(&pack_proc.args, "pack-objects");
+ if (opts->quiet)
+ strvec_push(&pack_proc.args, "--quiet");
+ strvec_pushf(&pack_proc.args, "%s/pack/loose", r->objects->odb->path);
+
+ pack_proc.in = -1;
+
+ if (start_command(&pack_proc)) {
+ error(_("failed to start 'git pack-objects' process"));
+ return 1;
+ }
+
+ data.in = xfdopen(pack_proc.in, "w");
+ data.count = 0;
+ data.batch_size = 50000;
+
+ for_each_loose_file_in_objdir(r->objects->odb->path,
+ write_loose_object_to_stdin,
+ NULL,
+ NULL,
+ &data);
+
+ fclose(data.in);
+
+ if (finish_command(&pack_proc)) {
+ error(_("failed to finish 'git pack-objects' process"));
+ result = 1;
+ }
+
+ return result;
+}
+
+static int maintenance_task_loose_objects(struct maintenance_run_opts *opts)
+{
+ return prune_packed(opts) || pack_loose(opts);
+}
+
typedef int maintenance_task_fn(struct maintenance_run_opts *opts);
/*
@@ -901,6 +993,7 @@ struct maintenance_task {
enum maintenance_task_label {
TASK_PREFETCH,
+ TASK_LOOSE_OBJECTS,
TASK_GC,
TASK_COMMIT_GRAPH,
@@ -913,6 +1006,10 @@ static struct maintenance_task tasks[] = {
"prefetch",
maintenance_task_prefetch,
},
+ [TASK_LOOSE_OBJECTS] = {
+ "loose-objects",
+ maintenance_task_loose_objects,
+ },
[TASK_GC] = {
"gc",
maintenance_task_gc,
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index 045524e6ad..b3fc7c8670 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -88,4 +88,43 @@ test_expect_success 'prefetch multiple remotes' '
test_cmp_rev refs/remotes/remote2/two refs/prefetch/remote2/two
'
+test_expect_success 'loose-objects task' '
+ # Repack everything so we know the state of the object dir
+ git repack -adk &&
+
+ # Hack to stop maintenance from running during "git commit"
+ echo in use >.git/objects/maintenance.lock &&
+
+ # Assuming that "git commit" creates at least one loose object
+ test_commit create-loose-object &&
+ rm .git/objects/maintenance.lock &&
+
+ ls .git/objects >obj-dir-before &&
+ test_file_not_empty obj-dir-before &&
+ ls .git/objects/pack/*.pack >packs-before &&
+ test_line_count = 1 packs-before &&
+
+ # The first run creates a pack-file
+ # but does not delete loose objects.
+ git maintenance run --task=loose-objects &&
+ ls .git/objects >obj-dir-between &&
+ test_cmp obj-dir-before obj-dir-between &&
+ ls .git/objects/pack/*.pack >packs-between &&
+ test_line_count = 2 packs-between &&
+ ls .git/objects/pack/loose-*.pack >loose-packs &&
+ test_line_count = 1 loose-packs &&
+
+ # The second run deletes loose objects
+ # but does not create a pack-file.
+ git maintenance run --task=loose-objects &&
+ ls .git/objects >obj-dir-after &&
+ cat >expect <<-\EOF &&
+ info
+ pack
+ EOF
+ test_cmp expect obj-dir-after &&
+ ls .git/objects/pack/*.pack >packs-after &&
+ test_cmp packs-between packs-after
+'
+
test_done
--
gitgitgadget
next prev parent reply other threads:[~2020-09-25 12:33 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-06 16:30 [PATCH 0/9] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 1/9] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano via GitGitGadget
2020-08-12 23:10 ` Emily Shaffer
2020-08-13 0:03 ` Junio C Hamano
2020-08-13 1:45 ` Jonathan Nieder
2020-08-13 4:37 ` [PATCH v3] " Junio C Hamano
2020-08-14 1:13 ` Derrick Stolee
2020-08-14 1:32 ` Junio C Hamano
2020-08-06 16:30 ` [PATCH 2/9] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-08-12 23:10 ` Emily Shaffer
2020-08-14 1:28 ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 3/9] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-08-12 23:10 ` Emily Shaffer
2020-08-14 1:46 ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 4/9] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 5/9] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 6/9] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 7/9] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 8/9] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-06 17:02 ` Son Luong Ngoc
2020-08-06 18:13 ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 9/9] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 0/9] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 1/9] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 2/9] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 3/9] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 4/9] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 5/9] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 6/9] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 7/9] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 8/9] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 9/9] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-08-25 18:36 ` [PATCH v3 0/8] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-25 18:36 ` [PATCH v3 1/8] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-09-22 23:05 ` Jonathan Tan
2020-08-25 18:36 ` [PATCH v3 2/8] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-09-22 23:09 ` Jonathan Tan
2020-09-24 13:45 ` Derrick Stolee
2020-08-25 18:36 ` [PATCH v3 3/8] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-09-22 23:15 ` Jonathan Tan
2020-09-24 13:51 ` Derrick Stolee
2020-08-25 18:36 ` [PATCH v3 4/8] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-09-22 23:16 ` Jonathan Tan
2020-09-24 13:53 ` Derrick Stolee
2020-08-25 18:36 ` [PATCH v3 5/8] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-25 18:36 ` [PATCH v3 6/8] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-09-22 23:26 ` Jonathan Tan
2020-09-24 14:05 ` Derrick Stolee
2020-09-24 22:01 ` Jonathan Tan
2020-08-25 18:36 ` [PATCH v3 7/8] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-25 18:36 ` [PATCH v3 8/8] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-09-22 23:52 ` Jonathan Tan
2020-08-25 20:59 ` [PATCH v3 0/8] Maintenance II: prefetch, loose-objects, incremental-repack tasks Junio C Hamano
2020-08-26 15:15 ` Son Luong Ngoc
2020-08-26 16:21 ` Derrick Stolee
2020-09-25 12:33 ` [PATCH v4 " Derrick Stolee via GitGitGadget
2020-09-25 12:33 ` [PATCH v4 1/8] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-09-25 12:33 ` Derrick Stolee via GitGitGadget [this message]
2020-09-25 12:33 ` [PATCH v4 3/8] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-09-25 18:00 ` Junio C Hamano
2020-09-25 18:43 ` Derrick Stolee
2020-09-25 12:33 ` [PATCH v4 4/8] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-09-25 12:33 ` [PATCH v4 5/8] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-09-25 12:33 ` [PATCH v4 6/8] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-09-25 12:33 ` [PATCH v4 7/8] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-09-25 12:33 ` [PATCH v4 8/8] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f3a16fd324a6b74ce93e8b764553ff7f4705b42e.1601037218.git.gitgitgadget@gmail.com \
--to=gitgitgadget@gmail.com \
--cc=congdanhqx@gmail.com \
--cc=derrickstolee@github.com \
--cc=dstolee@microsoft.com \
--cc=emilyshaffer@google.com \
--cc=git@vger.kernel.org \
--cc=jonathantanmy@google.com \
--cc=jrnieder@gmail.com \
--cc=peff@peff.net \
--cc=phillip.wood123@gmail.com \
--cc=sandals@crustytoothpaste.net \
--cc=sluongng@gmail.com \
--cc=steadmon@google.com \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).