From: Jeff King <peff@peff.net>
To: Rafael Silva <rafaeloliveira.cs@gmail.com>
Cc: "Jonathan Tan" <jonathantanmy@google.com>,
"SZEDER Gábor" <szeder.dev@gmail.com>,
"Git Mailing List" <git@vger.kernel.org>
Subject: Re: rather slow 'git repack' in 'blob:none' partial clones
Date: Wed, 7 Apr 2021 17:17:50 -0400 [thread overview]
Message-ID: <YG4hfge2y/AmcklZ@coredump.intra.peff.net> (raw)
In-Reply-To: <gohp6ko8et3jdm.fsf@cpm12071.fritz.box>
On Mon, Apr 05, 2021 at 03:02:33AM +0200, Rafael Silva wrote:
> When I was digging into the code and adding trace2_region_*() calls, I
> notice most of the time spent on the `git gc` (for the reported
> situation) was in:
>
> # In builtin/pack-objects.c
> static void get_object_list(int ac, const char **av)
> {
> ...
> if (unpack_unreachable)
> loosen_unused_packed_objects();
> ...
> }
Yeah, good find.
This is my first time looking at the repacking strategy for partial
clones. It looks like we run an initial pack-objects to cover all the
promisor objects, and then do the "real" repack for everything else,
with "--exclude-promisor-objects".
The purpose of loosen_unused_packed_objects() is to catch any objects
that will be lost when our caller deletes all of the packs. But in this
case, those promisor objects are in a pack which won't be deleted, so
they should not be included.
> Another interesting thing is, the loosen_unused_packed_objects()
> function is being called twice because the function loads all packs
> files, via get_all_packs(), which will return the .temp-*pack file that
> is created by the `git pack-objects` child process from `git gc`:
>
> git pack-objects ... --delta-base-offset objects/pack/.tmp-82853-pack ...
Yes, this is the "promisor" pack created by git-repack. It seems like
git-repack should tell pack-objects about the new pack with --keep-pack,
so that we know it is not going to be deleted.
That would also solve the rest of the problem, I _think_. In your
suggestion here:
> I'm not entirely sure about this (not this late in the day), but it seems to
> me that we should simply skip the "missing" (promisor) files when
> operating on a partial clone.
>
> Perhaps something like:
>
> --- >8 ---
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 525c2d8552..fedf58323d 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
> {
> if (!unpack_unreachable_expiration)
> return 0;
> + if (exclude_promisor_objects && is_promisor_object(oid))
> + return 1;
> if (mtime > unpack_unreachable_expiration)
> return 0;
> if (oid_array_lookup(&recent_objects, oid) >= 0)
> --- >8 ---
you are avoiding writing out the file. But we should realize much
earlier that it is not something we need to even consider loosening.
In the loop in loosen_unused_packed_objects(), we skip packs that are
marked as "keep", so we'd skip the new promisor pack entirely. But we'd
still see all these objects in the _old_ promisor pack. However, for
each object there, we call has_sha1_pack_kept_or_nonlocal(), so that
would likewise realize that each object is already being kept in the
other pack.
Something like this seems to work, but I only lightly tested it, and it
could probably use some refactoring to make it less horrible:
diff --git a/builtin/repack.c b/builtin/repack.c
index fdee8e4578..457525953a 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -574,6 +574,23 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
repack_promisor_objects(&po_args, &names);
if (existing_packs.nr && delete_redundant) {
+ /*
+ * tell pack-objects about our new promisor pack, which
+ * we will also be keeping
+ */
+ for_each_string_list_item(item, &names) {
+ /*
+ * yuck, we seem to only have the name with the
+ * packdir prefixed
+ */
+ const char *prefix;
+ if (!skip_prefix(packtmp, packdir, &prefix) ||
+ *prefix++ != '/')
+ BUG("confused by packtmp");
+ strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
+ prefix, item->string);
+ }
+
if (unpack_unreachable) {
strvec_pushf(&cmd.args,
"--unpack-unreachable=%s",
Do you want to try to work with that?
> A quick benchmark did show some promising result:
>
> # built from: 2e36527f23 (The sixth batch, 2021-04-02)
> Benchmark #1: ./bin-wrappers/git -C git.git gc
> Time (mean ± σ): 135.669 s ± 0.665 s [User: 42.789 s, System: 91.332 s]
> Range (min … max): 134.905 s … 136.115 s 3 runs
>
> # built from: 2e36527f23 + minor patch (from above)
> Benchmark #2: ./bin-wrappers/git -C git.git gc
> Time (mean ± σ): 12.586 s ± 0.031 s [User: 11.462 s, System: 1.365 s]
> Range (min … max): 12.553 s … 12.616 s 3 runs
>
> Summary:
> 'Benchmark #2' ran 10.78 ± 0.06 times faster than 'Benchmark #1'
It's still quite a bit slower than a non-partial clone because the
traversal with --exclude-promisor-objects is slow. I think that's
because it has to open up all of the objects in the promisor pack to see
what they refer to. I don't know if we can do better (and it's largely
an orthogonal problem to what you're solving here, so it probably makes
sense to just punt on it for now).
-Peff
next prev parent reply other threads:[~2021-04-07 21:17 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-03 9:04 rather slow 'git repack' in 'blob:none' partial clones SZEDER Gábor
2021-04-05 1:02 ` Rafael Silva
2021-04-07 21:17 ` Jeff King [this message]
2021-04-08 0:02 ` Jonathan Tan
2021-04-08 0:35 ` Jeff King
2021-04-12 7:09 ` Rafael Silva
2021-04-12 21:36 ` SZEDER Gábor
2021-04-12 21:49 ` Bryan Turner
2021-04-12 23:51 ` Jeff King
2021-04-12 23:47 ` Jeff King
2021-04-13 7:12 ` [PATCH 0/3] low-hanging performance fruit with promisor packs Jeff King
2021-04-13 7:15 ` [PATCH 1/3] is_promisor_object(): free tree buffer after parsing Jeff King
2021-04-13 20:17 ` Junio C Hamano
2021-04-14 5:18 ` Jeff King
2021-04-13 7:16 ` [PATCH 2/3] lookup_unknown_object(): take a repository argument Jeff King
2021-04-13 7:17 ` [PATCH 3/3] revision: avoid parsing with --exclude-promisor-objects Jeff King
2021-04-13 20:22 ` Junio C Hamano
2021-04-13 18:10 ` [PATCH 0/3] low-hanging performance fruit with promisor packs SZEDER Gábor
2021-04-14 17:14 ` Jonathan Tan
2021-04-14 19:22 ` Rafael Silva
2021-04-13 18:05 ` rather slow 'git repack' in 'blob:none' partial clones SZEDER Gábor
2021-04-14 5:14 ` Jeff King
2021-04-11 10:59 ` SZEDER Gábor
2021-04-12 7:53 ` Rafael Silva
2021-04-14 19:14 ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Rafael Silva
2021-04-14 19:14 ` [PATCH 1/2] repack: teach --no-prune-packed to skip `git prune-packed` Rafael Silva
2021-04-14 23:50 ` Jonathan Tan
2021-04-18 14:15 ` Rafael Silva
2021-04-14 19:14 ` [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones Rafael Silva
2021-04-15 1:04 ` Jonathan Tan
2021-04-15 3:51 ` Junio C Hamano
2021-04-15 9:03 ` Jeff King
2021-04-15 9:05 ` Jeff King
2021-04-18 7:12 ` Rafael Silva
2021-04-15 18:06 ` Junio C Hamano
2021-04-18 8:40 ` Rafael Silva
2021-04-14 22:10 ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Junio C Hamano
2021-04-15 9:15 ` Jeff King
2021-04-18 8:20 ` Rafael Silva
2021-04-18 13:57 ` [PATCH v2 0/1] " Rafael Silva
2021-04-18 13:57 ` [PATCH v2 1/1] repack: avoid loosening promisor objects in partial clones Rafael Silva
2021-04-19 19:15 ` Jonathan Tan
2021-04-21 18:54 ` Rafael Silva
2021-04-19 23:09 ` Junio C Hamano
2021-04-21 19:25 ` Rafael Silva
2021-04-21 19:32 ` [PATCH v3] " Rafael Silva
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YG4hfge2y/AmcklZ@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=jonathantanmy@google.com \
--cc=rafaeloliveira.cs@gmail.com \
--cc=szeder.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).