git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* rather slow 'git repack' in 'blob:none' partial clones
@ 2021-04-03  9:04 SZEDER Gábor
  2021-04-05  1:02 ` Rafael Silva
  0 siblings, 1 reply; 12+ messages in thread
From: SZEDER Gábor @ 2021-04-03  9:04 UTC (permalink / raw)
  To: Git Mailing List

Hi,

here are trace timings of running 'git gc' in a "normal" and in a
'blob:none' partial clone:

  $ git clone --bare https://github.com/git/git git-full.git
  $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-full.git/ gc
  10:35:24.007277 trace.c:487             performance: 0.001550225 s: git command: /usr/local/libexec/git-core/git pack-refs --all --prune
  10:35:24.044641 trace.c:487             performance: 0.035631270 s: git command: /usr/local/libexec/git-core/git reflog expire --all
  10:35:24.061070 read-cache.c:2315       performance: 0.000008506 s:  read cache ./index
  Enumerating objects: 305283, done.
  Counting objects: 100% (305283/305283), done.
  Delta compression using up to 4 threads
  Compressing objects: 100% (75016/75016), done.
  Writing objects: 100% (305283/305283), done.
  Total 305283 (delta 227928), reused 305283 (delta 227928), pack-reused 0
  10:35:32.604546 trace.c:487             performance: 8.555651283 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946975-pack --keep-true-parents --honor-pack-keep --non-empty --all --reflog --indexed-objects --unpack-unreachable=2.weeks.ago
  10:35:32.680597 trace.c:487             performance: 8.633068356 s: git command: /usr/local/libexec/git-core/git repack -d -l -A --unpack-unreachable=2.weeks.ago
  10:35:32.683130 trace.c:487             performance: 0.000959377 s: git command: /usr/local/libexec/git-core/git prune --expire 2.weeks.ago
  10:35:32.684401 trace.c:487             performance: 0.000180173 s: git command: /usr/local/libexec/git-core/git worktree prune --expire 3.months.ago
  10:35:32.685730 trace.c:487             performance: 0.000263898 s: git command: /usr/local/libexec/git-core/git rerere gc
  10:35:33.514816 trace.c:487             performance: 9.511597988 s: git command: git -C git-full.git/ gc
  elapsed: 0:09.51  max RSS: 358964k

  $ git clone --bare --filter=blob:none https://github.com/git/git git-partial.git
  $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-partial.git/ gc
  10:35:47.637735 trace.c:487             performance: 0.000872539 s: git command: /usr/local/libexec/git-core/git pack-refs --all --prune
  10:35:47.675498 trace.c:487             performance: 0.036246403 s: git command: /usr/local/libexec/git-core/git reflog expire --all
  Enumerating objects: 188205, done.
  Counting objects: 100% (188205/188205), done.
  Delta compression using up to 4 threads
  Compressing objects: 100% (66520/66520), done.
  Writing objects: 100% (188205/188205), done.
  Total 188205 (delta 119967), reused 188205 (delta 119967), pack-reused 0
  10:35:50.081709 trace.c:487             performance: 2.402625839 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack
  10:35:50.100131 read-cache.c:2315       performance: 0.000009979 s:  read cache ./index
  10:37:04.973541 trace.c:487             performance: 74.885793630 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack --keep-true-parents --honor-pack-keep --non-empty --all --reflog --indexed-objects --exclude-promisor-objects --unpack-unreachable=2.weeks.ago
  Removing duplicate objects: 100% (256/256), done.
  10:37:07.482791 trace.c:487             performance: 79.804973525 s: git command: /usr/local/libexec/git-core/git repack -d -l -A --unpack-unreachable=2.weeks.ago
  10:37:07.549333 trace.c:487             performance: 0.008025426 s: git command: /usr/local/libexec/git-core/git prune --expire 2.weeks.ago --exclude-promisor-objects
  10:37:07.552499 trace.c:487             performance: 0.000362981 s: git command: /usr/local/libexec/git-core/git worktree prune --expire 3.months.ago
  10:37:07.554521 trace.c:487             performance: 0.000273834 s: git command: /usr/local/libexec/git-core/git rerere gc
  10:37:10.168233 trace.c:487             performance: 82.533331484 s: git command: git -C git-partial.git/ gc
  elapsed: 1:22.54  max RSS: 1891832k

Notice the ~9s vs. 82s runtime and ~350M vs. 1.9G memory consumption
increase.  What's going on here?

Also note that that second 'git pack-objects' invocation doesn't show
any progress for ~75s.

FWIW, doing the same in a 'tree:0' partial clone is fast.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-03  9:04 rather slow 'git repack' in 'blob:none' partial clones SZEDER Gábor
@ 2021-04-05  1:02 ` Rafael Silva
  2021-04-07 21:17   ` Jeff King
  2021-04-11 10:59   ` SZEDER Gábor
  0 siblings, 2 replies; 12+ messages in thread
From: Rafael Silva @ 2021-04-05  1:02 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Git Mailing List


SZEDER Gábor <szeder.dev@gmail.com> writes:

> Hi,
>
> here are trace timings of running 'git gc' in a "normal" and in a
> 'blob:none' partial clone:
>
>   $ git clone --bare https://github.com/git/git git-full.git
>   $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-full.git/ gc
>   10:35:24.007277 trace.c:487             performance: 0.001550225 s: git command: /usr/local/libexec/git-core/git pack-refs --all --prune
>   10:35:24.044641 trace.c:487             performance: 0.035631270 s: git command: /usr/local/libexec/git-core/git reflog expire --all
>   10:35:24.061070 read-cache.c:2315       performance: 0.000008506 s:  read cache ./index
>   Enumerating objects: 305283, done.
>   Counting objects: 100% (305283/305283), done.
>   Delta compression using up to 4 threads
>   Compressing objects: 100% (75016/75016), done.
>   Writing objects: 100% (305283/305283), done.
>   Total 305283 (delta 227928), reused 305283 (delta 227928), pack-reused 0
>   10:35:32.604546 trace.c:487             performance: 8.555651283 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946975-pack --keep-true-parents --honor-pack-keep --non-empty --all --reflog --indexed-objects --unpack-unreachable=2.weeks.ago
>   10:35:32.680597 trace.c:487             performance: 8.633068356 s: git command: /usr/local/libexec/git-core/git repack -d -l -A --unpack-unreachable=2.weeks.ago
>   10:35:32.683130 trace.c:487             performance: 0.000959377 s: git command: /usr/local/libexec/git-core/git prune --expire 2.weeks.ago
>   10:35:32.684401 trace.c:487             performance: 0.000180173 s: git command: /usr/local/libexec/git-core/git worktree prune --expire 3.months.ago
>   10:35:32.685730 trace.c:487             performance: 0.000263898 s: git command: /usr/local/libexec/git-core/git rerere gc
>   10:35:33.514816 trace.c:487             performance: 9.511597988 s: git command: git -C git-full.git/ gc
>   elapsed: 0:09.51  max RSS: 358964k
>
>   $ git clone --bare --filter=blob:none https://github.com/git/git git-partial.git
>   $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-partial.git/ gc
>   10:35:47.637735 trace.c:487             performance: 0.000872539 s: git command: /usr/local/libexec/git-core/git pack-refs --all --prune
>   10:35:47.675498 trace.c:487             performance: 0.036246403 s: git command: /usr/local/libexec/git-core/git reflog expire --all
>   Enumerating objects: 188205, done.
>   Counting objects: 100% (188205/188205), done.
>   Delta compression using up to 4 threads
>   Compressing objects: 100% (66520/66520), done.
>   Writing objects: 100% (188205/188205), done.
>   Total 188205 (delta 119967), reused 188205 (delta 119967), pack-reused 0
>   10:35:50.081709 trace.c:487             performance: 2.402625839 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack
>   10:35:50.100131 read-cache.c:2315       performance: 0.000009979 s:  read cache ./index
>   10:37:04.973541 trace.c:487             performance: 74.885793630 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack --keep-true-parents --honor-pack-keep --non-empty --all --reflog --indexed-objects --exclude-promisor-objects --unpack-unreachable=2.weeks.ago
>   Removing duplicate objects: 100% (256/256), done.
>   10:37:07.482791 trace.c:487             performance: 79.804973525 s: git command: /usr/local/libexec/git-core/git repack -d -l -A --unpack-unreachable=2.weeks.ago
>   10:37:07.549333 trace.c:487             performance: 0.008025426 s: git command: /usr/local/libexec/git-core/git prune --expire 2.weeks.ago --exclude-promisor-objects
>   10:37:07.552499 trace.c:487             performance: 0.000362981 s: git command: /usr/local/libexec/git-core/git worktree prune --expire 3.months.ago
>   10:37:07.554521 trace.c:487             performance: 0.000273834 s: git command: /usr/local/libexec/git-core/git rerere gc
>   10:37:10.168233 trace.c:487             performance: 82.533331484 s: git command: git -C git-partial.git/ gc
>   elapsed: 1:22.54  max RSS: 1891832k
>
> Notice the ~9s vs. 82s runtime and ~350M vs. 1.9G memory consumption
> increase.  What's going on here?
>
> Also note that that second 'git pack-objects' invocation doesn't show
> any progress for ~75s.
>
> FWIW, doing the same in a 'tree:0' partial clone is fast.

I'm not expert on the area - by "area": the entire git code base :).
However, I was intrigued by this performance numbers and decided to give
it a try on the investigation, mostly for learning. While I'm not sure
about the solution of the problem, I decided to share it here with the
hope that at least I'll be saving someone else time.

When I was digging into the code and adding trace2_region_*() calls, I
notice most of the time spent on the `git gc` (for the reported
situation) was in:

       # In builtin/pack-objects.c
       static void get_object_list(int ac, const char **av)
       {
               ...
               if (unpack_unreachable)
                       loosen_unused_packed_objects();
               ...
       }

The loosen_unused_packed_objects() will unpack unreachable objects as
loose objects, and given that the partial cloned .pack file is
incomplete, this result in writing a lot of loose objects in $GIT_DIR
increasing the execution time and memory consumption. This can be seen
by watching the $GIT_DIR/objects/ during the `git gc` execution on the
partial cloned repo.  On the fully clone repository all the objects
exist, at least on the fresh clone like in your report thus no object is
loose from the .pack file.

To provide some insight in the magnitude of the written loose objects,
I counted the number of objects that was being feed into
force_object_loose() with the following patch:

-- >8 --
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 525c2d8552..f912b54a5f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3478,7 +3478,7 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
 static void loosen_unused_packed_objects(void)
 {
 	struct packed_git *p;
-	uint32_t i;
+	uint32_t i, loosen_obj_counter = 0;
 	struct object_id oid;
 
 	for (p = get_all_packs(the_repository); p; p = p->next) {
@@ -3492,10 +3492,13 @@ static void loosen_unused_packed_objects(void)
 			nth_packed_object_id(&oid, p, i);
 			if (!packlist_find(&to_pack, &oid) &&
 			    !has_sha1_pack_kept_or_nonlocal(&oid) &&
-			    !loosened_object_can_be_discarded(&oid, p->mtime))
+			    !loosened_object_can_be_discarded(&oid, p->mtime) &&
+			    ++loosen_obj_counter)
 				if (force_object_loose(&oid, p->mtime))
 					die(_("unable to force loose object"));
 		}
+		fprintf(stderr, "loosen_unused_packed_objects() total of objects: %d\n", p->num_objects);
+		fprintf(stderr, "loosen_unused_packed_objects() objects that is being loosed: %d\n", loosen_obj_counter);
 	}
 }
 
-- >8 --

Running on a fresh and fully cloned git.git repo, there are (obviously)
0 unreachable objects that are being written as loose objects:

    $ bin-wrappers/git clone --bare https://github.com/git/git git-full.git
    $ time --format='elapsed: %E | max RSS: %Mk' bin-wrappers/git -C git-full.git gc
    loosen_unused_packed_objects() total of objects: 305292
    loosen_unused_packed_objects() objects that is being loosed: 0
    Enumerating objects: 305292, done.
    Counting objects: 100% (305292/305292), done.
    Delta compression using up to 8 threads
    Compressing objects: 100% (75035/75035), done.
    Writing objects: 100% (305292/305292), done.
    Selecting bitmap commits: 63006, done.
    Building bitmaps: 100% (317/317), done.
    Total 305292 (delta 227918), reused 305292 (delta 227918), pack-reused 0
    elapsed: 0:09.23 | max RSS: 438628k

On the other hand, when running on a fresh and partial cloned repo, we
can see that all the objects (at least according to my findings) are
being move out of .pack file into a loose object.

    $ bin-wrappers/git clone --bare --filter=blob:none https://github.com/git/git git-partial.git
    $ time --format='elapsed: %E | max RSS: %Mk' bin-wrappers/git -C git-partial.git gc
    Enumerating objects: 188213, done.
    Counting objects: 100% (188213/188213), done.
    Delta compression using up to 8 threads
    Compressing objects: 100% (66524/66524), done.
    Writing objects: 100% (188213/188213), done.
    Total 188213 (delta 119971), reused 188213 (delta 119971), pack-reused 0
    loosen_unused_packed_objects() total of objects: 188213
    loosen_unused_packed_objects() objects that is being loosed: 188213
    loosen_unused_packed_objects() total of objects: 188213
    loosen_unused_packed_objects() objects that is being loosed: 376426
    Removing duplicate objects: 100% (256/256), done.
    elapsed: 3:24.86 | max RSS: 2085552k

Another interesting thing is, the loosen_unused_packed_objects()
function is being called twice because the function loads all packs
files, via get_all_packs(), which will return the .temp-*pack file that
is created by the `git pack-objects` child process from `git gc`:

    git pack-objects ... --delta-base-offset objects/pack/.tmp-82853-pack ...

For this specific case, this make the situation worse as we end up
processing the object from both packfiles.  Also, the second
execution, which operates on the normal .pack file, is processing 2x
"objects" (in terms of counting the execution of
force_object_loose()), I couldn't quite figure out why.

I'm not entirely sure about this (not this late in the day), but it seems to
me that we should simply skip the "missing" (promisor) files when
operating on a partial clone.

Perhaps something like:

--- >8 ---
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 525c2d8552..fedf58323d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
 {
        if (!unpack_unreachable_expiration)
                return 0;
+       if (exclude_promisor_objects && is_promisor_object(oid))
+               return 1;
        if (mtime > unpack_unreachable_expiration)
                return 0;
        if (oid_array_lookup(&recent_objects, oid) >= 0)
--- >8 ---

I'll try to prepare a patch for this change with proper testing, if this
turns out to be proper way to handle partial clone repository.

A quick benchmark did show some promising result:

    # built from: 2e36527f23 (The sixth batch, 2021-04-02)
    Benchmark #1: ./bin-wrappers/git -C git.git gc
          Time (mean ± σ):     135.669 s ±  0.665 s    [User: 42.789 s, System: 91.332 s]
          Range (min … max):   134.905 s … 136.115 s    3 runs

    # built from: 2e36527f23 + minor patch (from above)
    Benchmark #2: ./bin-wrappers/git -C git.git gc
          Time (mean ± σ):     12.586 s ±  0.031 s    [User: 11.462 s, System: 1.365 s]
          Range (min … max):   12.553 s … 12.616 s    3 runs

    Summary:
          'Benchmark #2' ran 10.78 ± 0.06 times faster than 'Benchmark #1'


-- 
Thanks
Rafael

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-05  1:02 ` Rafael Silva
@ 2021-04-07 21:17   ` Jeff King
  2021-04-08  0:02     ` Jonathan Tan
                       ` (2 more replies)
  2021-04-11 10:59   ` SZEDER Gábor
  1 sibling, 3 replies; 12+ messages in thread
From: Jeff King @ 2021-04-07 21:17 UTC (permalink / raw)
  To: Rafael Silva; +Cc: Jonathan Tan, SZEDER Gábor, Git Mailing List

On Mon, Apr 05, 2021 at 03:02:33AM +0200, Rafael Silva wrote:

> When I was digging into the code and adding trace2_region_*() calls, I
> notice most of the time spent on the `git gc` (for the reported
> situation) was in:
> 
>        # In builtin/pack-objects.c
>        static void get_object_list(int ac, const char **av)
>        {
>                ...
>                if (unpack_unreachable)
>                        loosen_unused_packed_objects();
>                ...
>        }

Yeah, good find.

This is my first time looking at the repacking strategy for partial
clones. It looks like we run an initial pack-objects to cover all the
promisor objects, and then do the "real" repack for everything else,
with "--exclude-promisor-objects".

The purpose of loosen_unused_packed_objects() is to catch any objects
that will be lost when our caller deletes all of the packs. But in this
case, those promisor objects are in a pack which won't be deleted, so
they should not be included.

> Another interesting thing is, the loosen_unused_packed_objects()
> function is being called twice because the function loads all packs
> files, via get_all_packs(), which will return the .temp-*pack file that
> is created by the `git pack-objects` child process from `git gc`:
> 
>     git pack-objects ... --delta-base-offset objects/pack/.tmp-82853-pack ...

Yes, this is the "promisor" pack created by git-repack. It seems like
git-repack should tell pack-objects about the new pack with --keep-pack,
so that we know it is not going to be deleted.

That would also solve the rest of the problem, I _think_. In your
suggestion here:

> I'm not entirely sure about this (not this late in the day), but it seems to
> me that we should simply skip the "missing" (promisor) files when
> operating on a partial clone.
> 
> Perhaps something like:
> 
> --- >8 ---
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 525c2d8552..fedf58323d 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
>  {
>         if (!unpack_unreachable_expiration)
>                 return 0;
> +       if (exclude_promisor_objects && is_promisor_object(oid))
> +               return 1;
>         if (mtime > unpack_unreachable_expiration)
>                 return 0;
>         if (oid_array_lookup(&recent_objects, oid) >= 0)
> --- >8 ---

you are avoiding writing out the file. But we should realize much
earlier that it is not something we need to even consider loosening.

In the loop in loosen_unused_packed_objects(), we skip packs that are
marked as "keep", so we'd skip the new promisor pack entirely. But we'd
still see all these objects in the _old_ promisor pack. However, for
each object there, we call has_sha1_pack_kept_or_nonlocal(), so that
would likewise realize that each object is already being kept in the
other pack.

Something like this seems to work, but I only lightly tested it, and it
could probably use some refactoring to make it less horrible:

diff --git a/builtin/repack.c b/builtin/repack.c
index fdee8e4578..457525953a 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -574,6 +574,23 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		repack_promisor_objects(&po_args, &names);
 
 		if (existing_packs.nr && delete_redundant) {
+			/*
+			 * tell pack-objects about our new promisor pack, which
+			 * we will also be keeping
+			 */
+			for_each_string_list_item(item, &names) {
+				/*
+				 * yuck, we seem to only have the name with the
+				 * packdir prefixed
+				 */
+				const char *prefix;
+				if (!skip_prefix(packtmp, packdir, &prefix) ||
+				    *prefix++ != '/')
+					BUG("confused by packtmp");
+				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
+					     prefix, item->string);
+			}
+
 			if (unpack_unreachable) {
 				strvec_pushf(&cmd.args,
 					     "--unpack-unreachable=%s",

Do you want to try to work with that?

> A quick benchmark did show some promising result:
> 
>     # built from: 2e36527f23 (The sixth batch, 2021-04-02)
>     Benchmark #1: ./bin-wrappers/git -C git.git gc
>           Time (mean ± σ):     135.669 s ±  0.665 s    [User: 42.789 s, System: 91.332 s]
>           Range (min … max):   134.905 s … 136.115 s    3 runs
> 
>     # built from: 2e36527f23 + minor patch (from above)
>     Benchmark #2: ./bin-wrappers/git -C git.git gc
>           Time (mean ± σ):     12.586 s ±  0.031 s    [User: 11.462 s, System: 1.365 s]
>           Range (min … max):   12.553 s … 12.616 s    3 runs
> 
>     Summary:
>           'Benchmark #2' ran 10.78 ± 0.06 times faster than 'Benchmark #1'

It's still quite a bit slower than a non-partial clone because the
traversal with --exclude-promisor-objects is slow. I think that's
because it has to open up all of the objects in the promisor pack to see
what they refer to. I don't know if we can do better (and it's largely
an orthogonal problem to what you're solving here, so it probably makes
sense to just punt on it for now).

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-07 21:17   ` Jeff King
@ 2021-04-08  0:02     ` Jonathan Tan
  2021-04-08  0:35       ` Jeff King
  2021-04-12  7:09     ` Rafael Silva
  2021-04-12 21:36     ` SZEDER Gábor
  2 siblings, 1 reply; 12+ messages in thread
From: Jonathan Tan @ 2021-04-08  0:02 UTC (permalink / raw)
  To: peff; +Cc: rafaeloliveira.cs, jonathantanmy, szeder.dev, git

> On Mon, Apr 05, 2021 at 03:02:33AM +0200, Rafael Silva wrote:
> 
> > When I was digging into the code and adding trace2_region_*() calls, I
> > notice most of the time spent on the `git gc` (for the reported
> > situation) was in:
> > 
> >        # In builtin/pack-objects.c
> >        static void get_object_list(int ac, const char **av)
> >        {
> >                ...
> >                if (unpack_unreachable)
> >                        loosen_unused_packed_objects();
> >                ...
> >        }
> 
> Yeah, good find.

Agreed!

> This is my first time looking at the repacking strategy for partial
> clones. It looks like we run an initial pack-objects to cover all the
> promisor objects, and then do the "real" repack for everything else,
> with "--exclude-promisor-objects".

That is correct - we need two separate packs because one of them needs
to be accompanied by a separate ".promisor" file to show that those
objects are from the promisor remote.

> The purpose of loosen_unused_packed_objects() is to catch any objects
> that will be lost when our caller deletes all of the packs. But in this
> case, those promisor objects are in a pack which won't be deleted, so
> they should not be included.

Makes sense.

> > I'm not entirely sure about this (not this late in the day), but it seems to
> > me that we should simply skip the "missing" (promisor) files when
> > operating on a partial clone.
> > 
> > Perhaps something like:
> > 
> > --- >8 ---
> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index 525c2d8552..fedf58323d 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
> >  {
> >         if (!unpack_unreachable_expiration)
> >                 return 0;
> > +       if (exclude_promisor_objects && is_promisor_object(oid))
> > +               return 1;
> >         if (mtime > unpack_unreachable_expiration)
> >                 return 0;
> >         if (oid_array_lookup(&recent_objects, oid) >= 0)
> > --- >8 ---

I think this will work. The only thing we might need to watch out for is
that if we repack with --exclude-promisor-objects, some might say that
every excluded object should be loosened. I don't think that's true in
this case, because (1) the object is not unreachable and the argument
talks about loosening unreachable objects, (2) promisor objects can be
re-obtained from the promisor remote anyway. So I think we're fine.

But I think Peff suggested something better below.

> In the loop in loosen_unused_packed_objects(), we skip packs that are
> marked as "keep", so we'd skip the new promisor pack entirely. But we'd
> still see all these objects in the _old_ promisor pack. However, for
> each object there, we call has_sha1_pack_kept_or_nonlocal(), so that
> would likewise realize that each object is already being kept in the
> other pack.
> 
> Something like this seems to work, but I only lightly tested it, and it
> could probably use some refactoring to make it less horrible:
> 
> diff --git a/builtin/repack.c b/builtin/repack.c
> index fdee8e4578..457525953a 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -574,6 +574,23 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		repack_promisor_objects(&po_args, &names);
>  
>  		if (existing_packs.nr && delete_redundant) {
> +			/*
> +			 * tell pack-objects about our new promisor pack, which
> +			 * we will also be keeping
> +			 */
> +			for_each_string_list_item(item, &names) {
> +				/*
> +				 * yuck, we seem to only have the name with the
> +				 * packdir prefixed
> +				 */
> +				const char *prefix;
> +				if (!skip_prefix(packtmp, packdir, &prefix) ||
> +				    *prefix++ != '/')
> +					BUG("confused by packtmp");
> +				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
> +					     prefix, item->string);
> +			}
> +
>  			if (unpack_unreachable) {
>  				strvec_pushf(&cmd.args,
>  					     "--unpack-unreachable=%s",
> 
> Do you want to try to work with that?

It seems to me that this would work. The only part I was confused about
is "packtmp", but that is just the pack directory plus a specific
prefix, so "pack-objects" will indeed see that packfile as being part of
the repository - no problem.

repack_promisor_objects() might be able to be refactored to provide the
names in a format that we want, but looking at it, I don't think it's
possible (it just uses "packtmp", so we have the same "packtmp"
problem).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-08  0:02     ` Jonathan Tan
@ 2021-04-08  0:35       ` Jeff King
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff King @ 2021-04-08  0:35 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: rafaeloliveira.cs, szeder.dev, git

On Wed, Apr 07, 2021 at 05:02:41PM -0700, Jonathan Tan wrote:

> It seems to me that this would work. The only part I was confused about
> is "packtmp", but that is just the pack directory plus a specific
> prefix, so "pack-objects" will indeed see that packfile as being part of
> the repository - no problem.
> 
> repack_promisor_objects() might be able to be refactored to provide the
> names in a format that we want, but looking at it, I don't think it's
> possible (it just uses "packtmp", so we have the same "packtmp"
> problem).

Yeah, that was the refactoring I alluded to. I think the earlier code
should keep the ".tmp-%d" portion in a separate string, and then
construct packtmp from that. And then we don't have to try to recover it
from the concatenated string.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-05  1:02 ` Rafael Silva
  2021-04-07 21:17   ` Jeff King
@ 2021-04-11 10:59   ` SZEDER Gábor
  2021-04-12  7:53     ` Rafael Silva
  1 sibling, 1 reply; 12+ messages in thread
From: SZEDER Gábor @ 2021-04-11 10:59 UTC (permalink / raw)
  To: Rafael Silva; +Cc: Git Mailing List

On Mon, Apr 05, 2021 at 03:02:33AM +0200, Rafael Silva wrote:
> > here are trace timings of running 'git gc' in a "normal" and in a
> > 'blob:none' partial clone:
> >
> >   $ git clone --bare https://github.com/git/git git-full.git
> >   $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-full.git/ gc
[...]
> >   elapsed: 0:09.51  max RSS: 358964k
> >
> >   $ git clone --bare --filter=blob:none https://github.com/git/git git-partial.git
> >   $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-partial.git/ gc
> >   10:35:47.637735 trace.c:487             performance: 0.000872539 s: git command: /usr/local/libexec/git-core/git pack-refs --all --prune
> >   10:35:47.675498 trace.c:487             performance: 0.036246403 s: git command: /usr/local/libexec/git-core/git reflog expire --all
> >   Enumerating objects: 188205, done.
> >   Counting objects: 100% (188205/188205), done.
> >   Delta compression using up to 4 threads
> >   Compressing objects: 100% (66520/66520), done.
> >   Writing objects: 100% (188205/188205), done.
> >   Total 188205 (delta 119967), reused 188205 (delta 119967), pack-reused 0
> >   10:35:50.081709 trace.c:487             performance: 2.402625839 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack
> >   10:35:50.100131 read-cache.c:2315       performance: 0.000009979 s:  read cache ./index
> >   10:37:04.973541 trace.c:487             performance: 74.885793630 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack --keep-true-parents --honor-pack-keep --non-empty --all --reflog --indexed-objects --exclude-promisor-objects --unpack-unreachable=2.weeks.ago
> >   Removing duplicate objects: 100% (256/256), done.
> >   10:37:07.482791 trace.c:487             performance: 79.804973525 s: git command: /usr/local/libexec/git-core/git repack -d -l -A --unpack-unreachable=2.weeks.ago
> >   10:37:07.549333 trace.c:487             performance: 0.008025426 s: git command: /usr/local/libexec/git-core/git prune --expire 2.weeks.ago --exclude-promisor-objects
> >   10:37:07.552499 trace.c:487             performance: 0.000362981 s: git command: /usr/local/libexec/git-core/git worktree prune --expire 3.months.ago
> >   10:37:07.554521 trace.c:487             performance: 0.000273834 s: git command: /usr/local/libexec/git-core/git rerere gc
> >   10:37:10.168233 trace.c:487             performance: 82.533331484 s: git command: git -C git-partial.git/ gc
> >   elapsed: 1:22.54  max RSS: 1891832k
> >
> > Notice the ~9s vs. 82s runtime and ~350M vs. 1.9G memory consumption
> > increase.  What's going on here?
> >
> > Also note that that second 'git pack-objects' invocation doesn't show
> > any progress for ~75s.
> >
> > FWIW, doing the same in a 'tree:0' partial clone is fast.
> 
> I'm not expert on the area - by "area": the entire git code base :).
> However, I was intrigued by this performance numbers and decided to give
> it a try on the investigation, mostly for learning.

That's the spirit!

> While I'm not sure
> about the solution of the problem, I decided to share it here with the
> hope that at least I'll be saving someone else time.
> 
> When I was digging into the code and adding trace2_region_*() calls, I
> notice most of the time spent on the `git gc` (for the reported
> situation) was in:
> 
>        # In builtin/pack-objects.c
>        static void get_object_list(int ac, const char **av)
>        {
>                ...
>                if (unpack_unreachable)
>                        loosen_unused_packed_objects();
>                ...
>        }
> 
> The loosen_unused_packed_objects() will unpack unreachable objects as
> loose objects, and given that the partial cloned .pack file is
> incomplete, this result in writing a lot of loose objects in $GIT_DIR
> increasing the execution time and memory consumption. This can be seen
> by watching the $GIT_DIR/objects/ during the `git gc` execution on the
> partial cloned repo.

Indeed, that 'blob:none' partial clone grew in size temporarily to
over 1.3GB during repacking, but luckily all those unnecessarily
loosened objects were removed at the end.  I first noticed this issue
while attempting to repack a considerably larger partial-cloned
repository, which I aborted because it ate up all the memory...  I
suppose that even if it didn't use that much memory, it would
eventually run out of available disk space for all those loose objects
anyway...


> I'm not entirely sure about this (not this late in the day), but it seems to
> me that we should simply skip the "missing" (promisor) files when
> operating on a partial clone.
> 
> Perhaps something like:
> 
> --- >8 ---
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 525c2d8552..fedf58323d 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
>  {
>         if (!unpack_unreachable_expiration)
>                 return 0;
> +       if (exclude_promisor_objects && is_promisor_object(oid))
> +               return 1;
>         if (mtime > unpack_unreachable_expiration)
>                 return 0;
>         if (oid_array_lookup(&recent_objects, oid) >= 0)
> --- >8 ---
> 
> I'll try to prepare a patch for this change with proper testing, if this
> turns out to be proper way to handle partial clone repository.
> 
> A quick benchmark did show some promising result:
> 
>     # built from: 2e36527f23 (The sixth batch, 2021-04-02)
>     Benchmark #1: ./bin-wrappers/git -C git.git gc
>           Time (mean ± σ):     135.669 s ±  0.665 s    [User: 42.789 s, System: 91.332 s]
>           Range (min … max):   134.905 s … 136.115 s    3 runs
> 
>     # built from: 2e36527f23 + minor patch (from above)
>     Benchmark #2: ./bin-wrappers/git -C git.git gc
>           Time (mean ± σ):     12.586 s ±  0.031 s    [User: 11.462 s, System: 1.365 s]
>           Range (min … max):   12.553 s … 12.616 s    3 runs
> 
>     Summary:
>           'Benchmark #2' ran 10.78 ± 0.06 times faster than 'Benchmark #1'

I can confirm that you change speeds up repacking considerably and it
completely eliminates that temporary repo size explision due to
unpacked objects, but, alas, it doesn't seem to reduce the memory
usage.

Thanks,
Gábor


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-07 21:17   ` Jeff King
  2021-04-08  0:02     ` Jonathan Tan
@ 2021-04-12  7:09     ` Rafael Silva
  2021-04-12 21:36     ` SZEDER Gábor
  2 siblings, 0 replies; 12+ messages in thread
From: Rafael Silva @ 2021-04-12  7:09 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, SZEDER Gábor, Git Mailing List


Jeff King <peff@peff.net> writes:

> On Mon, Apr 05, 2021 at 03:02:33AM +0200, Rafael Silva wrote:
>
>> I'm not entirely sure about this (not this late in the day), but it seems to
>> me that we should simply skip the "missing" (promisor) files when
>> operating on a partial clone.
>> 
>> Perhaps something like:
>> 
>> --- >8 ---
>> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
>> index 525c2d8552..fedf58323d 100644
>> --- a/builtin/pack-objects.c
>> +++ b/builtin/pack-objects.c
>> @@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
>>  {
>>         if (!unpack_unreachable_expiration)
>>                 return 0;
>> +       if (exclude_promisor_objects && is_promisor_object(oid))
>> +               return 1;
>>         if (mtime > unpack_unreachable_expiration)
>>                 return 0;
>>         if (oid_array_lookup(&recent_objects, oid) >= 0)
>> --- >8 ---
>
> you are avoiding writing out the file. But we should realize much
> earlier that it is not something we need to even consider loosening.
>
> In the loop in loosen_unused_packed_objects(), we skip packs that are
> marked as "keep", so we'd skip the new promisor pack entirely. But we'd
> still see all these objects in the _old_ promisor pack. However, for
> each object there, we call has_sha1_pack_kept_or_nonlocal(), so that
> would likewise realize that each object is already being kept in the
> other pack.
>

Agreed. Realizing sooner that we shouldn't even consider loosening the
objects from the packfile it's better solution.

> Something like this seems to work, but I only lightly tested it, and it
> could probably use some refactoring to make it less horrible:
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index fdee8e4578..457525953a 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -574,6 +574,23 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		repack_promisor_objects(&po_args, &names);
>  
>  		if (existing_packs.nr && delete_redundant) {
> +			/*
> +			 * tell pack-objects about our new promisor pack, which
> +			 * we will also be keeping
> +			 */
> +			for_each_string_list_item(item, &names) {
> +				/*
> +				 * yuck, we seem to only have the name with the
> +				 * packdir prefixed
> +				 */
> +				const char *prefix;
> +				if (!skip_prefix(packtmp, packdir, &prefix) ||
> +				    *prefix++ != '/')
> +					BUG("confused by packtmp");
> +				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
> +					     prefix, item->string);
> +			}
> +
>  			if (unpack_unreachable) {
>  				strvec_pushf(&cmd.args,
>  					     "--unpack-unreachable=%s",
>
> Do you want to try to work with that?
>

Yes, I'll try to work with that, together with refactoring that you
mentioned in the code and the other replies.

Thanks for the suggestion.

>> A quick benchmark did show some promising result:
>> 
>>     # built from: 2e36527f23 (The sixth batch, 2021-04-02)
>>     Benchmark #1: ./bin-wrappers/git -C git.git gc
>>           Time (mean ± σ):     135.669 s ±  0.665 s    [User: 42.789 s, System: 91.332 s]
>>           Range (min … max):   134.905 s … 136.115 s    3 runs
>> 
>>     # built from: 2e36527f23 + minor patch (from above)
>>     Benchmark #2: ./bin-wrappers/git -C git.git gc
>>           Time (mean ± σ):     12.586 s ±  0.031 s    [User: 11.462 s, System: 1.365 s]
>>           Range (min … max):   12.553 s … 12.616 s    3 runs
>> 
>>     Summary:
>>           'Benchmark #2' ran 10.78 ± 0.06 times faster than 'Benchmark #1'
>
> It's still quite a bit slower than a non-partial clone because the
> traversal with --exclude-promisor-objects is slow. I think that's
> because it has to open up all of the objects in the promisor pack to see
> what they refer to. I don't know if we can do better (and it's largely
> an orthogonal problem to what you're solving here, so it probably makes
> sense to just punt on it for now).
>
> -Peff

Make sense.

-- 
Thanks
Rafael

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-11 10:59   ` SZEDER Gábor
@ 2021-04-12  7:53     ` Rafael Silva
  0 siblings, 0 replies; 12+ messages in thread
From: Rafael Silva @ 2021-04-12  7:53 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Git Mailing List


SZEDER Gábor <szeder.dev@gmail.com> writes:

> On Mon, Apr 05, 2021 at 03:02:33AM +0200, Rafael Silva wrote:
>> > here are trace timings of running 'git gc' in a "normal" and in a
>> > 'blob:none' partial clone:
>> >
>> >   $ git clone --bare https://github.com/git/git git-full.git
>> >   $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-full.git/ gc
> [...]
>> >   elapsed: 0:09.51  max RSS: 358964k
>> >
>> >   $ git clone --bare --filter=blob:none https://github.com/git/git git-partial.git
>> >   $ GIT_TRACE_PERFORMANCE=2 /usr/bin/time --format='elapsed: %E  max RSS: %Mk' git -C git-partial.git/ gc
>> >   10:35:47.637735 trace.c:487             performance: 0.000872539 s: git command: /usr/local/libexec/git-core/git pack-refs --all --prune
>> >   10:35:47.675498 trace.c:487             performance: 0.036246403 s: git command: /usr/local/libexec/git-core/git reflog expire --all
>> >   Enumerating objects: 188205, done.
>> >   Counting objects: 100% (188205/188205), done.
>> >   Delta compression using up to 4 threads
>> >   Compressing objects: 100% (66520/66520), done.
>> >   Writing objects: 100% (188205/188205), done.
>> >   Total 188205 (delta 119967), reused 188205 (delta 119967), pack-reused 0
>> >   10:35:50.081709 trace.c:487             performance: 2.402625839 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack
>> >   10:35:50.100131 read-cache.c:2315       performance: 0.000009979 s:  read cache ./index
>> >   10:37:04.973541 trace.c:487             performance: 74.885793630 s: git command: /usr/local/libexec/git-core/git pack-objects --local --delta-base-offset objects/pack/.tmp-2946990-pack --keep-true-parents --honor-pack-keep --non-empty --all --reflog --indexed-objects --exclude-promisor-objects --unpack-unreachable=2.weeks.ago
>> >   Removing duplicate objects: 100% (256/256), done.
>> >   10:37:07.482791 trace.c:487             performance: 79.804973525 s: git command: /usr/local/libexec/git-core/git repack -d -l -A --unpack-unreachable=2.weeks.ago
>> >   10:37:07.549333 trace.c:487             performance: 0.008025426 s: git command: /usr/local/libexec/git-core/git prune --expire 2.weeks.ago --exclude-promisor-objects
>> >   10:37:07.552499 trace.c:487             performance: 0.000362981 s: git command: /usr/local/libexec/git-core/git worktree prune --expire 3.months.ago
>> >   10:37:07.554521 trace.c:487             performance: 0.000273834 s: git command: /usr/local/libexec/git-core/git rerere gc
>> >   10:37:10.168233 trace.c:487             performance: 82.533331484 s: git command: git -C git-partial.git/ gc
>> >   elapsed: 1:22.54  max RSS: 1891832k
>> >
>> > Notice the ~9s vs. 82s runtime and ~350M vs. 1.9G memory consumption
>> > increase.  What's going on here?
>> >
>> > Also note that that second 'git pack-objects' invocation doesn't show
>> > any progress for ~75s.
>> >
>> > FWIW, doing the same in a 'tree:0' partial clone is fast.
>> 
>> I'm not expert on the area - by "area": the entire git code base :).
>> However, I was intrigued by this performance numbers and decided to give
>> it a try on the investigation, mostly for learning.
>
> That's the spirit!
>

:)

>> While I'm not sure
>> about the solution of the problem, I decided to share it here with the
>> hope that at least I'll be saving someone else time.
>> 
>> When I was digging into the code and adding trace2_region_*() calls, I
>> notice most of the time spent on the `git gc` (for the reported
>> situation) was in:
>> 
>>        # In builtin/pack-objects.c
>>        static void get_object_list(int ac, const char **av)
>>        {
>>                ...
>>                if (unpack_unreachable)
>>                        loosen_unused_packed_objects();
>>                ...
>>        }
>> 
>> The loosen_unused_packed_objects() will unpack unreachable objects as
>> loose objects, and given that the partial cloned .pack file is
>> incomplete, this result in writing a lot of loose objects in $GIT_DIR
>> increasing the execution time and memory consumption. This can be seen
>> by watching the $GIT_DIR/objects/ during the `git gc` execution on the
>> partial cloned repo.
>
> Indeed, that 'blob:none' partial clone grew in size temporarily to
> over 1.3GB during repacking, but luckily all those unnecessarily
> loosened objects were removed at the end.  I first noticed this issue
> while attempting to repack a considerably larger partial-cloned
> repository, which I aborted because it ate up all the memory...  I
> suppose that even if it didn't use that much memory, it would
> eventually run out of available disk space for all those loose objects
> anyway...
>
>
>> I'm not entirely sure about this (not this late in the day), but it seems to
>> me that we should simply skip the "missing" (promisor) files when
>> operating on a partial clone.
>> 
>> Perhaps something like:
>> 
>> --- >8 ---
>> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
>> index 525c2d8552..fedf58323d 100644
>> --- a/builtin/pack-objects.c
>> +++ b/builtin/pack-objects.c
>> @@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
>>  {
>>         if (!unpack_unreachable_expiration)
>>                 return 0;
>> +       if (exclude_promisor_objects && is_promisor_object(oid))
>> +               return 1;
>>         if (mtime > unpack_unreachable_expiration)
>>                 return 0;
>>         if (oid_array_lookup(&recent_objects, oid) >= 0)
>> --- >8 ---
>> 
>> I'll try to prepare a patch for this change with proper testing, if this
>> turns out to be proper way to handle partial clone repository.
>> 
>> A quick benchmark did show some promising result:
>> 
>>     # built from: 2e36527f23 (The sixth batch, 2021-04-02)
>>     Benchmark #1: ./bin-wrappers/git -C git.git gc
>>           Time (mean ± σ):     135.669 s ±  0.665 s    [User: 42.789 s, System: 91.332 s]
>>           Range (min … max):   134.905 s … 136.115 s    3 runs
>> 
>>     # built from: 2e36527f23 + minor patch (from above)
>>     Benchmark #2: ./bin-wrappers/git -C git.git gc
>>           Time (mean ± σ):     12.586 s ±  0.031 s    [User: 11.462 s, System: 1.365 s]
>>           Range (min … max):   12.553 s … 12.616 s    3 runs
>> 
>>     Summary:
>>           'Benchmark #2' ran 10.78 ± 0.06 times faster than 'Benchmark #1'
>
> I can confirm that you change speeds up repacking considerably and it
> completely eliminates that temporary repo size explision due to
> unpacked objects, but, alas, it doesn't seem to reduce the memory
> usage.
>
> Thanks,
> Gábor

Thanks for confirming it. The memory usage, indeed, is almost the same
after the changes.  Nevertheless, Peff suggested a better approach to
addressing this issue, by realizing earlier that we should even
consider loosening the objects from the pack.

I'll try to work with that approach and see how this apply to memory
usage.

-- 
Thanks
Rafael

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-07 21:17   ` Jeff King
  2021-04-08  0:02     ` Jonathan Tan
  2021-04-12  7:09     ` Rafael Silva
@ 2021-04-12 21:36     ` SZEDER Gábor
  2021-04-12 21:49       ` Bryan Turner
  2021-04-12 23:47       ` Jeff King
  2 siblings, 2 replies; 12+ messages in thread
From: SZEDER Gábor @ 2021-04-12 21:36 UTC (permalink / raw)
  To: Jeff King; +Cc: Rafael Silva, Jonathan Tan, Git Mailing List

On Wed, Apr 07, 2021 at 05:17:50PM -0400, Jeff King wrote:
> On Mon, Apr 05, 2021 at 03:02:33AM +0200, Rafael Silva wrote:
> 
> > When I was digging into the code and adding trace2_region_*() calls, I
> > notice most of the time spent on the `git gc` (for the reported
> > situation) was in:
> > 
> >        # In builtin/pack-objects.c
> >        static void get_object_list(int ac, const char **av)
> >        {
> >                ...
> >                if (unpack_unreachable)
> >                        loosen_unused_packed_objects();
> >                ...
> >        }
> 
> Yeah, good find.
> 
> This is my first time looking at the repacking strategy for partial
> clones. It looks like we run an initial pack-objects to cover all the
> promisor objects, and then do the "real" repack for everything else,
> with "--exclude-promisor-objects".
> 
> The purpose of loosen_unused_packed_objects() is to catch any objects
> that will be lost when our caller deletes all of the packs. But in this
> case, those promisor objects are in a pack which won't be deleted, so
> they should not be included.
> 
> > Another interesting thing is, the loosen_unused_packed_objects()
> > function is being called twice because the function loads all packs
> > files, via get_all_packs(), which will return the .temp-*pack file that
> > is created by the `git pack-objects` child process from `git gc`:
> > 
> >     git pack-objects ... --delta-base-offset objects/pack/.tmp-82853-pack ...
> 
> Yes, this is the "promisor" pack created by git-repack. It seems like
> git-repack should tell pack-objects about the new pack with --keep-pack,
> so that we know it is not going to be deleted.
> 
> That would also solve the rest of the problem, I _think_. In your
> suggestion here:
> 
> > I'm not entirely sure about this (not this late in the day), but it seems to
> > me that we should simply skip the "missing" (promisor) files when
> > operating on a partial clone.
> > 
> > Perhaps something like:
> > 
> > --- >8 ---
> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index 525c2d8552..fedf58323d 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
> >  {
> >         if (!unpack_unreachable_expiration)
> >                 return 0;
> > +       if (exclude_promisor_objects && is_promisor_object(oid))
> > +               return 1;
> >         if (mtime > unpack_unreachable_expiration)
> >                 return 0;
> >         if (oid_array_lookup(&recent_objects, oid) >= 0)
> > --- >8 ---
> 
> you are avoiding writing out the file. But we should realize much
> earlier that it is not something we need to even consider loosening.
> 
> In the loop in loosen_unused_packed_objects(), we skip packs that are
> marked as "keep", so we'd skip the new promisor pack entirely. But we'd
> still see all these objects in the _old_ promisor pack. However, for
> each object there, we call has_sha1_pack_kept_or_nonlocal(), so that
> would likewise realize that each object is already being kept in the
> other pack.
> 
> Something like this seems to work, but I only lightly tested it, and it
> could probably use some refactoring to make it less horrible:
> 
> diff --git a/builtin/repack.c b/builtin/repack.c
> index fdee8e4578..457525953a 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -574,6 +574,23 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		repack_promisor_objects(&po_args, &names);
>  
>  		if (existing_packs.nr && delete_redundant) {
> +			/*
> +			 * tell pack-objects about our new promisor pack, which
> +			 * we will also be keeping
> +			 */
> +			for_each_string_list_item(item, &names) {
> +				/*
> +				 * yuck, we seem to only have the name with the
> +				 * packdir prefixed
> +				 */
> +				const char *prefix;
> +				if (!skip_prefix(packtmp, packdir, &prefix) ||
> +				    *prefix++ != '/')
> +					BUG("confused by packtmp");
> +				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
> +					     prefix, item->string);
> +			}
> +
>  			if (unpack_unreachable) {
>  				strvec_pushf(&cmd.args,
>  					     "--unpack-unreachable=%s",
> 

All what you wrote above makes sense to me, but I've never looked at
how partial clones work, so it doesn't mean anything...  In any case
your patch brings great speedups, but, unfortunately, the memory usage
remains as high as it was:

  $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk /home/szeder/src/git/bin-wrappers/git -C git-partial.git/ gc
  Enumerating objects: 188450, done.
  Counting objects: 100% (188450/188450), done.
  Delta compression using up to 4 threads
  Compressing objects: 100% (66623/66623), done.
  Writing objects: 100% (188450/188450), done.
  Total 188450 (delta 120109), reused 188450 (delta 120109), pack-reused 0
  elapsed: 0:15.18  max RSS: 1888332k

And git.git is not all that large, I wonder how much memory would be
necessary to 'gc' a 'blob:none' clone of e.g. chromium?! :)

BTW, this high memory usage in a partial clone is not specific to
'repack', 'fsck' suffers just as much:

  $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk git -C git-full.git/ fsck
  Checking object directories: 100% (256/256), done.
  warning in tag d6602ec5194c87b0fc87103ca4d67251c76f233a: missingTaggerEntry: invalid format - expected 'tagger' line
  Checking objects: 100% (305662/305662), done.
  Checking connectivity: 305662, done.
  Verifying commits in commit graph: 100% (65031/65031), done.
  elapsed: 0:33.99  max RSS: 281936k
  $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk git -C git-partial.git/ fsck
  Checking object directories: 100% (256/256), done.
  warning in tag d6602ec5194c87b0fc87103ca4d67251c76f233a: missingTaggerEntry: invalid format - expected 'tagger' line
  Checking objects: 100% (188450/188450), done.
  Verifying commits in commit graph: 100% (65031/65031), done.
  elapsed: 0:29.96  max RSS: 1877340k


And a somewhat related issue: when the server doesn't support filters,
then 'git clone --filter=...' prints a warning and proceeds to clone
the full repo.  Reading ba95710a3b ({fetch,upload}-pack: support
filter in protocol v2, 2018-05-03) this seems to be intentional and I
tend to think that it makes sense (though I managed to overlook that
warning twice today...  I surely wouldn't have overlooked a hard
error, but that would perhaps be too harsh in this case, dunno).
However, the resulting full clone is still marked as partial:

  $ git clone --bare --filter=blob:none https://git.kernel.org/pub/scm/git/git.git git-not-really-partial.git
  Cloning into bare repository 'git-not-really-partial.git'...
  warning: filtering not recognized by server, ignoring
  remote: Enumerating objects: 591, done.
  remote: Counting objects: 100% (591/591), done.
  remote: Compressing objects: 100% (293/293), done.
  remote: Total 305662 (delta 372), reused 393 (delta 298), pack-reused 305071
  Receiving objects: 100% (305662/305662), 96.83 MiB | 2.10 MiB/s, done.
  Resolving deltas: 100% (228123/228123), done.
  $ ls -l git-not-really-partial.git/objects/pack/
  total 107568
  -r--r--r-- 1 szeder szeder   8559608 Apr 12 21:13 pack-53f3ee0dfeaa8cea65c78473cd5904bf5ddfaa20.idx
  -r--r--r-- 1 szeder szeder 101535430 Apr 12 21:13 pack-53f3ee0dfeaa8cea65c78473cd5904bf5ddfaa20.pack
  -rw------- 1 szeder szeder     49012 Apr 12 21:13 pack-53f3ee0dfeaa8cea65c78473cd5904bf5ddfaa20.promisor
  $ cat git-not-really-partial.git/config 
  [core]
  	repositoryformatversion = 1
  	filemode = true
  	bare = true
  [remote "origin"]
  	url = https://git.kernel.org/pub/scm/git/git.git
  	promisor = true
  	partialclonefilter = blob:none

I wonder whether this is intentional, or that it is really the desired
behavior, considering that 'gc/repack/fsck' still treat it as a
partial clone, and, consequently, are affected by this slowness and
much higher memory usage, and since the repo now contains a lot more
objects than expected (all the blobs as well), they are much slower:

  $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk git -C git-not-really-partial.git/ gc
  Enumerating objects: 305662, done.
  Counting objects: 100% (305662/305662), done.
  Delta compression using up to 4 threads
  Compressing objects: 100% (75200/75200), done.
  Writing objects: 100% (305662/305662), done.
  Total 305662 (delta 228123), reused 305662 (delta 228123), pack-reused 0
  Removing duplicate objects: 100% (256/256), done.
  elapsed: 4:28.96  max RSS: 1985100k
  # with Peff's patch above:
  $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk /home/szeder/src/git/bin-wrappers/git -C git-not-really-partial.git/ gc
  Enumerating objects: 305662, done.
  Counting objects: 100% (305662/305662), done.
  Delta compression using up to 4 threads
  Compressing objects: 100% (75200/75200), done.
  Writing objects: 100% (305662/305662), done.
  Total 305662 (delta 228123), reused 305662 (delta 228123), pack-reused 0
  elapsed: 1:21.83  max RSS: 1959740k


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-12 21:36     ` SZEDER Gábor
@ 2021-04-12 21:49       ` Bryan Turner
  2021-04-12 23:51         ` Jeff King
  2021-04-12 23:47       ` Jeff King
  1 sibling, 1 reply; 12+ messages in thread
From: Bryan Turner @ 2021-04-12 21:49 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Jeff King, Rafael Silva, Jonathan Tan, Git Mailing List

On Mon, Apr 12, 2021 at 2:37 PM SZEDER Gábor <szeder.dev@gmail.com> wrote:
>
> And a somewhat related issue: when the server doesn't support filters,
> then 'git clone --filter=...' prints a warning and proceeds to clone
> the full repo.  Reading ba95710a3b ({fetch,upload}-pack: support
> filter in protocol v2, 2018-05-03) this seems to be intentional and I
> tend to think that it makes sense (though I managed to overlook that
> warning twice today...  I surely wouldn't have overlooked a hard
> error, but that would perhaps be too harsh in this case, dunno).
> However, the resulting full clone is still marked as partial:
>
>   $ git clone --bare --filter=blob:none https://git.kernel.org/pub/scm/git/git.git git-not-really-partial.git
>   Cloning into bare repository 'git-not-really-partial.git'...
>   warning: filtering not recognized by server, ignoring
>   remote: Enumerating objects: 591, done.
>   remote: Counting objects: 100% (591/591), done.
>   remote: Compressing objects: 100% (293/293), done.
>   remote: Total 305662 (delta 372), reused 393 (delta 298), pack-reused 305071
>   Receiving objects: 100% (305662/305662), 96.83 MiB | 2.10 MiB/s, done.
>   Resolving deltas: 100% (228123/228123), done.
>   $ ls -l git-not-really-partial.git/objects/pack/
>   total 107568
>   -r--r--r-- 1 szeder szeder   8559608 Apr 12 21:13 pack-53f3ee0dfeaa8cea65c78473cd5904bf5ddfaa20.idx
>   -r--r--r-- 1 szeder szeder 101535430 Apr 12 21:13 pack-53f3ee0dfeaa8cea65c78473cd5904bf5ddfaa20.pack
>   -rw------- 1 szeder szeder     49012 Apr 12 21:13 pack-53f3ee0dfeaa8cea65c78473cd5904bf5ddfaa20.promisor
>   $ cat git-not-really-partial.git/config
>   [core]
>         repositoryformatversion = 1
>         filemode = true
>         bare = true
>   [remote "origin"]
>         url = https://git.kernel.org/pub/scm/git/git.git
>         promisor = true
>         partialclonefilter = blob:none

I ran into this same surprising behavior recently, too. I was adding
some automated testing to Bitbucket for partial clones and initially
tried to use whether the repository was configured with a partial
clone filter as one of my checks, only to find that even when filters
weren't supported it was still set. The only way I could find to
detect that a partial clone that was requested didn't actually happen
was to parse the git clone output and look for the warning.

>
> I wonder whether this is intentional, or that it is really the desired
> behavior, considering that 'gc/repack/fsck' still treat it as a
> partial clone, and, consequently, are affected by this slowness and
> much higher memory usage, and since the repo now contains a lot more
> objects than expected (all the blobs as well), they are much slower:
>
>   $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk git -C git-not-really-partial.git/ gc
>   Enumerating objects: 305662, done.
>   Counting objects: 100% (305662/305662), done.
>   Delta compression using up to 4 threads
>   Compressing objects: 100% (75200/75200), done.
>   Writing objects: 100% (305662/305662), done.
>   Total 305662 (delta 228123), reused 305662 (delta 228123), pack-reused 0
>   Removing duplicate objects: 100% (256/256), done.
>   elapsed: 4:28.96  max RSS: 1985100k
>   # with Peff's patch above:
>   $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk /home/szeder/src/git/bin-wrappers/git -C git-not-really-partial.git/ gc
>   Enumerating objects: 305662, done.
>   Counting objects: 100% (305662/305662), done.
>   Delta compression using up to 4 threads
>   Compressing objects: 100% (75200/75200), done.
>   Writing objects: 100% (305662/305662), done.
>   Total 305662 (delta 228123), reused 305662 (delta 228123), pack-reused 0
>   elapsed: 1:21.83  max RSS: 1959740k
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-12 21:36     ` SZEDER Gábor
  2021-04-12 21:49       ` Bryan Turner
@ 2021-04-12 23:47       ` Jeff King
  1 sibling, 0 replies; 12+ messages in thread
From: Jeff King @ 2021-04-12 23:47 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Rafael Silva, Jonathan Tan, Git Mailing List

On Mon, Apr 12, 2021 at 11:36:53PM +0200, SZEDER Gábor wrote:

> All what you wrote above makes sense to me, but I've never looked at
> how partial clones work, so it doesn't mean anything...  In any case
> your patch brings great speedups, but, unfortunately, the memory usage
> remains as high as it was:
> 
>   $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk /home/szeder/src/git/bin-wrappers/git -C git-partial.git/ gc
>   Enumerating objects: 188450, done.
>   Counting objects: 100% (188450/188450), done.
>   Delta compression using up to 4 threads
>   Compressing objects: 100% (66623/66623), done.
>   Writing objects: 100% (188450/188450), done.
>   Total 188450 (delta 120109), reused 188450 (delta 120109), pack-reused 0
>   elapsed: 0:15.18  max RSS: 1888332k
> 
> And git.git is not all that large, I wonder how much memory would be
> necessary to 'gc' a 'blob:none' clone of e.g. chromium?! :)
> 
> BTW, this high memory usage in a partial clone is not specific to
> 'repack', 'fsck' suffers just as much:

I think the issue is in the exclude-promisor-object code paths of the
traversal. Try this:

  [two clones of git.git, one full and one partial]
  $ git clone --no-local --bare /path/to/git full.git
  $ git clone --no-local --bare --filter=blob:none /path/to/git partial.git

  [full clone is quick to traverse all objects]
  $ time git -C full.git rev-list --count --all
  63215
  real	0m0.369s
  user	0m0.365s
  sys	0m0.004s

  [partial is, too; it's the same amount of work because we're just
   looking at the commits here]
  $ time git -C partial.git rev-list --count --all
  63215
  real	0m0.373s
  user	0m0.364s
  sys	0m0.009s

  [but now ask it to exclude promisor objects, and it's much slower;
  this is because is_promisor_object() opens up each tree in the pack in
  order to see which "promised" objects it mentions]
  $ time git -C partial.git rev-list --exclude-promisor-objects --count --all
  0
  real	0m11.723s
  user	0m11.354s
  sys	0m0.369s

And I think that is the source for the memory use, too. Without that
option, we peak at 11MB heap. With it, 1.6GB. Oops. Looks like there
might be some "leaks" when we parse tree objects and then leave the
buffers connected to the structs (technically not a leak because we
still have the pointers, but obviously having every tree in memory at
once is not good).

The patch below drops the peak heap to 165MB. Still quite a bit more,
but I think it's a combination of delta-base cache (96MB) plus extra
structs for all the non-commit objects whose flags we marked.

It does seem like there's probably a good space/time tradeoff to be made
here (e.g., caching the list of "promisor" object ids for the pack
instead of inflating and reading all of the trees on the fly).

diff --git a/packfile.c b/packfile.c
index 8668345d93..b79cbc8cd4 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2247,6 +2247,7 @@ static int add_promisor_object(const struct object_id *oid,
 			return 0;
 		while (tree_entry_gently(&desc, &entry))
 			oidset_insert(set, &entry.oid);
+		free_tree_buffer(tree);
 	} else if (obj->type == OBJ_COMMIT) {
 		struct commit *commit = (struct commit *) obj;
 		struct commit_list *parents = commit->parents;
diff --git a/revision.c b/revision.c
index 553c0faa9b..fac2577748 100644
--- a/revision.c
+++ b/revision.c
@@ -3271,8 +3271,15 @@ static int mark_uninteresting(const struct object_id *oid,
 			      void *cb)
 {
 	struct rev_info *revs = cb;
+	/*
+	 * yikes, do we really need to parse here? maybe
+	 * lookup_unknown_object() would be sufficient, or
+	 * even oid_object_info() followed by the correct type
+	 */
 	struct object *o = parse_object(revs->repo, oid);
 	o->flags |= UNINTERESTING | SEEN;
+	if (o->type == OBJ_TREE)
+		free_tree_buffer((struct tree *)o);
 	return 0;
 }
 

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rather slow 'git repack' in 'blob:none' partial clones
  2021-04-12 21:49       ` Bryan Turner
@ 2021-04-12 23:51         ` Jeff King
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff King @ 2021-04-12 23:51 UTC (permalink / raw)
  To: Bryan Turner
  Cc: SZEDER Gábor, Rafael Silva, Jonathan Tan, Git Mailing List

On Mon, Apr 12, 2021 at 02:49:00PM -0700, Bryan Turner wrote:

> I ran into this same surprising behavior recently, too. I was adding
> some automated testing to Bitbucket for partial clones and initially
> tried to use whether the repository was configured with a partial
> clone filter as one of my checks, only to find that even when filters
> weren't supported it was still set. The only way I could find to
> detect that a partial clone that was requested didn't actually happen
> was to parse the git clone output and look for the warning.

I think the state of "we have all the objects, but things are marked as
partial" is some place you could actually get into naturally: you start
with a partial clone, and then later fetch the objects you need, and you
just happen to have all the objects). Or you could even start there if
the filter happens not to exclude any objects. So I don't think that
state is invalid in any way.

But I do agree that if the client _knows_ that the filter was not used
(because the other side did not advertise filters and so we did not even
send it), then it is silly to create the client-side config marking us
as partial. It is misleading at best, and makes things slower at worst.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-04-12 23:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-03  9:04 rather slow 'git repack' in 'blob:none' partial clones SZEDER Gábor
2021-04-05  1:02 ` Rafael Silva
2021-04-07 21:17   ` Jeff King
2021-04-08  0:02     ` Jonathan Tan
2021-04-08  0:35       ` Jeff King
2021-04-12  7:09     ` Rafael Silva
2021-04-12 21:36     ` SZEDER Gábor
2021-04-12 21:49       ` Bryan Turner
2021-04-12 23:51         ` Jeff King
2021-04-12 23:47       ` Jeff King
2021-04-11 10:59   ` SZEDER Gábor
2021-04-12  7:53     ` Rafael Silva

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git