git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Rafael Silva <rafaeloliveira.cs@gmail.com>
To: "SZEDER Gábor" <szeder.dev@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: rather slow 'git repack' in 'blob:none' partial clones
Date: Mon, 05 Apr 2021 03:02:33 +0200	[thread overview]
Message-ID: <gohp6ko8et3jdm.fsf@cpm12071.fritz.box> (raw)
In-Reply-To: <20210403090412.GH2271@szeder.dev>


SZEDER Gábor <szeder.dev@gmail.com> writes:

> Hi,
>
> here are trace timings of running 'git gc' in a "normal" and in a
> 'blob:none' partial clone:
>
>   $ git clone --bare https://github.com/git/git git-full.git
>   $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-full.git/ gc
>   10:35:24.007277 trace.c:487             performance: 0.001550225 s: git command: /usr/local/libexec/git-core/git pack-refs --all --prune
>   10:35:24.044641 trace.c:487             performance: 0.035631270 s: git command: /usr/local/libexec/git-core/git reflog expire --all
>   10:35:24.061070 read-cache.c:2315       performance: 0.000008506 s:  read cache ./index
>   Enumerating objects: 305283, done.
>   Counting objects: 100% (305283/305283), done.
>   Delta compression using up to 4 threads
>   Compressing objects: 100% (75016/75016), done.
>   Writing objects: 100% (305283/305283), done.
>   Total 305283 (delta 227928), reused 305283 (delta 227928), pack-reused 0
>   10:35:32.604546 trace.c:487             performance: 8.555651283 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946975-pack --keep-true-parents --honor-pack-keep --non-empty --all --reflog --indexed-objects --unpack-unreachable=2.weeks.ago
>   10:35:32.680597 trace.c:487             performance: 8.633068356 s: git command: /usr/local/libexec/git-core/git repack -d -l -A --unpack-unreachable=2.weeks.ago
>   10:35:32.683130 trace.c:487             performance: 0.000959377 s: git command: /usr/local/libexec/git-core/git prune --expire 2.weeks.ago
>   10:35:32.684401 trace.c:487             performance: 0.000180173 s: git command: /usr/local/libexec/git-core/git worktree prune --expire 3.months.ago
>   10:35:32.685730 trace.c:487             performance: 0.000263898 s: git command: /usr/local/libexec/git-core/git rerere gc
>   10:35:33.514816 trace.c:487             performance: 9.511597988 s: git command: git -C git-full.git/ gc
>   elapsed: 0:09.51  max RSS: 358964k
>
>   $ git clone --bare --filter=blob:none https://github.com/git/git git-partial.git
>   $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-partial.git/ gc
>   10:35:47.637735 trace.c:487             performance: 0.000872539 s: git command: /usr/local/libexec/git-core/git pack-refs --all --prune
>   10:35:47.675498 trace.c:487             performance: 0.036246403 s: git command: /usr/local/libexec/git-core/git reflog expire --all
>   Enumerating objects: 188205, done.
>   Counting objects: 100% (188205/188205), done.
>   Delta compression using up to 4 threads
>   Compressing objects: 100% (66520/66520), done.
>   Writing objects: 100% (188205/188205), done.
>   Total 188205 (delta 119967), reused 188205 (delta 119967), pack-reused 0
>   10:35:50.081709 trace.c:487             performance: 2.402625839 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack
>   10:35:50.100131 read-cache.c:2315       performance: 0.000009979 s:  read cache ./index
>   10:37:04.973541 trace.c:487             performance: 74.885793630 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack --keep-true-parents --honor-pack-keep --non-empty --all --reflog --indexed-objects --exclude-promisor-objects --unpack-unreachable=2.weeks.ago
>   Removing duplicate objects: 100% (256/256), done.
>   10:37:07.482791 trace.c:487             performance: 79.804973525 s: git command: /usr/local/libexec/git-core/git repack -d -l -A --unpack-unreachable=2.weeks.ago
>   10:37:07.549333 trace.c:487             performance: 0.008025426 s: git command: /usr/local/libexec/git-core/git prune --expire 2.weeks.ago --exclude-promisor-objects
>   10:37:07.552499 trace.c:487             performance: 0.000362981 s: git command: /usr/local/libexec/git-core/git worktree prune --expire 3.months.ago
>   10:37:07.554521 trace.c:487             performance: 0.000273834 s: git command: /usr/local/libexec/git-core/git rerere gc
>   10:37:10.168233 trace.c:487             performance: 82.533331484 s: git command: git -C git-partial.git/ gc
>   elapsed: 1:22.54  max RSS: 1891832k
>
> Notice the ~9s vs. 82s runtime and ~350M vs. 1.9G memory consumption
> increase.  What's going on here?
>
> Also note that that second 'git pack-objects' invocation doesn't show
> any progress for ~75s.
>
> FWIW, doing the same in a 'tree:0' partial clone is fast.

I'm not expert on the area - by "area": the entire git code base :).
However, I was intrigued by this performance numbers and decided to give
it a try on the investigation, mostly for learning. While I'm not sure
about the solution of the problem, I decided to share it here with the
hope that at least I'll be saving someone else time.

When I was digging into the code and adding trace2_region_*() calls, I
notice most of the time spent on the `git gc` (for the reported
situation) was in:

       # In builtin/pack-objects.c
       static void get_object_list(int ac, const char **av)
       {
               ...
               if (unpack_unreachable)
                       loosen_unused_packed_objects();
               ...
       }

The loosen_unused_packed_objects() will unpack unreachable objects as
loose objects, and given that the partial cloned .pack file is
incomplete, this result in writing a lot of loose objects in $GIT_DIR
increasing the execution time and memory consumption. This can be seen
by watching the $GIT_DIR/objects/ during the `git gc` execution on the
partial cloned repo.  On the fully clone repository all the objects
exist, at least on the fresh clone like in your report thus no object is
loose from the .pack file.

To provide some insight in the magnitude of the written loose objects,
I counted the number of objects that was being feed into
force_object_loose() with the following patch:

-- >8 --
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 525c2d8552..f912b54a5f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3478,7 +3478,7 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
 static void loosen_unused_packed_objects(void)
 {
 	struct packed_git *p;
-	uint32_t i;
+	uint32_t i, loosen_obj_counter = 0;
 	struct object_id oid;
 
 	for (p = get_all_packs(the_repository); p; p = p->next) {
@@ -3492,10 +3492,13 @@ static void loosen_unused_packed_objects(void)
 			nth_packed_object_id(&oid, p, i);
 			if (!packlist_find(&to_pack, &oid) &&
 			    !has_sha1_pack_kept_or_nonlocal(&oid) &&
-			    !loosened_object_can_be_discarded(&oid, p->mtime))
+			    !loosened_object_can_be_discarded(&oid, p->mtime) &&
+			    ++loosen_obj_counter)
 				if (force_object_loose(&oid, p->mtime))
 					die(_("unable to force loose object"));
 		}
+		fprintf(stderr, "loosen_unused_packed_objects() total of objects: %d\n", p->num_objects);
+		fprintf(stderr, "loosen_unused_packed_objects() objects that is being loosed: %d\n", loosen_obj_counter);
 	}
 }
 
-- >8 --

Running on a fresh and fully cloned git.git repo, there are (obviously)
0 unreachable objects that are being written as loose objects:

    $ bin-wrappers/git clone --bare https://github.com/git/git git-full.git
    $ time --format='elapsed: %E | max RSS: %Mk' bin-wrappers/git -C git-full.git gc
    loosen_unused_packed_objects() total of objects: 305292
    loosen_unused_packed_objects() objects that is being loosed: 0
    Enumerating objects: 305292, done.
    Counting objects: 100% (305292/305292), done.
    Delta compression using up to 8 threads
    Compressing objects: 100% (75035/75035), done.
    Writing objects: 100% (305292/305292), done.
    Selecting bitmap commits: 63006, done.
    Building bitmaps: 100% (317/317), done.
    Total 305292 (delta 227918), reused 305292 (delta 227918), pack-reused 0
    elapsed: 0:09.23 | max RSS: 438628k

On the other hand, when running on a fresh and partial cloned repo, we
can see that all the objects (at least according to my findings) are
being move out of .pack file into a loose object.

    $ bin-wrappers/git clone --bare --filter=blob:none https://github.com/git/git git-partial.git
    $ time --format='elapsed: %E | max RSS: %Mk' bin-wrappers/git -C git-partial.git gc
    Enumerating objects: 188213, done.
    Counting objects: 100% (188213/188213), done.
    Delta compression using up to 8 threads
    Compressing objects: 100% (66524/66524), done.
    Writing objects: 100% (188213/188213), done.
    Total 188213 (delta 119971), reused 188213 (delta 119971), pack-reused 0
    loosen_unused_packed_objects() total of objects: 188213
    loosen_unused_packed_objects() objects that is being loosed: 188213
    loosen_unused_packed_objects() total of objects: 188213
    loosen_unused_packed_objects() objects that is being loosed: 376426
    Removing duplicate objects: 100% (256/256), done.
    elapsed: 3:24.86 | max RSS: 2085552k

Another interesting thing is, the loosen_unused_packed_objects()
function is being called twice because the function loads all packs
files, via get_all_packs(), which will return the .temp-*pack file that
is created by the `git pack-objects` child process from `git gc`:

    git pack-objects ... --delta-base-offset objects/pack/.tmp-82853-pack ...

For this specific case, this make the situation worse as we end up
processing the object from both packfiles.  Also, the second
execution, which operates on the normal .pack file, is processing 2x
"objects" (in terms of counting the execution of
force_object_loose()), I couldn't quite figure out why.

I'm not entirely sure about this (not this late in the day), but it seems to
me that we should simply skip the "missing" (promisor) files when
operating on a partial clone.

Perhaps something like:

--- >8 ---
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 525c2d8552..fedf58323d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
 {
        if (!unpack_unreachable_expiration)
                return 0;
+       if (exclude_promisor_objects && is_promisor_object(oid))
+               return 1;
        if (mtime > unpack_unreachable_expiration)
                return 0;
        if (oid_array_lookup(&recent_objects, oid) >= 0)
--- >8 ---

I'll try to prepare a patch for this change with proper testing, if this
turns out to be proper way to handle partial clone repository.

A quick benchmark did show some promising result:

    # built from: 2e36527f23 (The sixth batch, 2021-04-02)
    Benchmark #1: ./bin-wrappers/git -C git.git gc
          Time (mean ± σ):     135.669 s ±  0.665 s    [User: 42.789 s, System: 91.332 s]
          Range (min … max):   134.905 s … 136.115 s    3 runs

    # built from: 2e36527f23 + minor patch (from above)
    Benchmark #2: ./bin-wrappers/git -C git.git gc
          Time (mean ± σ):     12.586 s ±  0.031 s    [User: 11.462 s, System: 1.365 s]
          Range (min … max):   12.553 s … 12.616 s    3 runs

    Summary:
          'Benchmark #2' ran 10.78 ± 0.06 times faster than 'Benchmark #1'


-- 
Thanks
Rafael

  reply	other threads:[~2021-04-05  1:02 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-03  9:04 rather slow 'git repack' in 'blob:none' partial clones SZEDER Gábor
2021-04-05  1:02 ` Rafael Silva [this message]
2021-04-07 21:17   ` Jeff King
2021-04-08  0:02     ` Jonathan Tan
2021-04-08  0:35       ` Jeff King
2021-04-12  7:09     ` Rafael Silva
2021-04-12 21:36     ` SZEDER Gábor
2021-04-12 21:49       ` Bryan Turner
2021-04-12 23:51         ` Jeff King
2021-04-12 23:47       ` Jeff King
2021-04-13  7:12         ` [PATCH 0/3] low-hanging performance fruit with promisor packs Jeff King
2021-04-13  7:15           ` [PATCH 1/3] is_promisor_object(): free tree buffer after parsing Jeff King
2021-04-13 20:17             ` Junio C Hamano
2021-04-14  5:18               ` Jeff King
2021-04-13  7:16           ` [PATCH 2/3] lookup_unknown_object(): take a repository argument Jeff King
2021-04-13  7:17           ` [PATCH 3/3] revision: avoid parsing with --exclude-promisor-objects Jeff King
2021-04-13 20:22             ` Junio C Hamano
2021-04-13 18:10           ` [PATCH 0/3] low-hanging performance fruit with promisor packs SZEDER Gábor
2021-04-14 17:14           ` Jonathan Tan
2021-04-14 19:22           ` Rafael Silva
2021-04-13 18:05         ` rather slow 'git repack' in 'blob:none' partial clones SZEDER Gábor
2021-04-14  5:14           ` Jeff King
2021-04-11 10:59   ` SZEDER Gábor
2021-04-12  7:53     ` Rafael Silva
2021-04-14 19:14 ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Rafael Silva
2021-04-14 19:14   ` [PATCH 1/2] repack: teach --no-prune-packed to skip `git prune-packed` Rafael Silva
2021-04-14 23:50     ` Jonathan Tan
2021-04-18 14:15       ` Rafael Silva
2021-04-14 19:14   ` [PATCH 2/2] repack: avoid loosening promisor pack objects in partial clones Rafael Silva
2021-04-15  1:04     ` Jonathan Tan
2021-04-15  3:51       ` Junio C Hamano
2021-04-15  9:03         ` Jeff King
2021-04-15  9:05       ` Jeff King
2021-04-18  7:12       ` Rafael Silva
2021-04-15 18:06     ` Junio C Hamano
2021-04-18  8:40       ` Rafael Silva
2021-04-14 22:10   ` [PATCH 0/2] prevent `repack` to unpack and delete promisor objects Junio C Hamano
2021-04-15  9:15   ` Jeff King
2021-04-18  8:20     ` Rafael Silva
2021-04-18 13:57   ` [PATCH v2 0/1] " Rafael Silva
2021-04-18 13:57     ` [PATCH v2 1/1] repack: avoid loosening promisor objects in partial clones Rafael Silva
2021-04-19 19:15       ` Jonathan Tan
2021-04-21 18:54         ` Rafael Silva
2021-04-19 23:09       ` Junio C Hamano
2021-04-21 19:25         ` Rafael Silva
2021-04-21 19:32     ` [PATCH v3] " Rafael Silva

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=gohp6ko8et3jdm.fsf@cpm12071.fritz.box \
    --to=rafaeloliveira.cs@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).