git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/2] speed up "Counting objects" when there are many packs
@ 2016-07-25 18:49 Jeff King
  2016-07-25 18:50 ` [PATCH 1/2] pack-objects: break out of want_object loop early Jeff King
  2016-07-25 18:50 ` [PATCH 2/2] pack-objects: compute local/ignore_pack_keep early Jeff King
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff King @ 2016-07-25 18:49 UTC (permalink / raw)
  To: git

We sometimes see cases at GitHub where repository maintenance has fallen
behind, and you get a large number of packs. The solution is to repack,
but that process is itself made a lot slower by the number of packs.

We've experimented a bit with fast "just cat all the packfiles together"
type approaches, but they have some downsides, so I have nothing to show
there yet.

However, there are a few easy optimizations we can do to cut out some
unnecessary computation in common cases (e.g., when you have no .keep
files and when you have no upstream alternates storage). Both of these
patches have been in production at GitHub for about 6 months.

  [1/2]: pack-objects: break out of want_object loop early
  [2/2]: pack-objects: compute local/ignore_pack_keep early

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] pack-objects: break out of want_object loop early
  2016-07-25 18:49 [PATCH 0/2] speed up "Counting objects" when there are many packs Jeff King
@ 2016-07-25 18:50 ` Jeff King
  2016-07-25 19:56   ` Junio C Hamano
  2016-07-25 18:50 ` [PATCH 2/2] pack-objects: compute local/ignore_pack_keep early Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: Jeff King @ 2016-07-25 18:50 UTC (permalink / raw)
  To: git

When pack-objects collects the list of objects to pack
(either from stdin, or via its internal rev-list), it
filters each one through want_object_in_pack().

This function loops through each existing packfile, looking
for the object. When we find it, we mark the pack/offset
combo for later use. However, we can't just return "yes, we
want it" at that point. If --honor-pack-keep is in effect,
we must keep looking to find it in _all_ packs, to make sure
none of them has a .keep. Likewise, if --local is in effect,
we must make sure it is not present in any local pack.

As a result, the sum effort of these calls is effectively
O(nr_objects * nr_packs). In an ordinary repository, we have
only a handful of packs, and this doesn't make a big
difference. But in pathological cases, it can slow the
counting phase to a crawl.

This patch notices the case that we have neither "--local"
nor "--honor-pack-keep" in effect and breaks out of the loop
early, after finding the first instance. Note that our worst
case is still "objects * packs" (i.e., we might find each
object in the last pack we look in), but in practice we will
often break out early. On an "average" repo, my git.git with
8 packs, this shows a modest 2% (a few dozen milliseconds)
improvement in the counting-objects phase of "git
pack-objects --all <foo" (hackily instrumented by sticking
exit(0) right after list_objects).

But in a much more pathological case, it makes a bigger
difference. I ran the same command on a real-world example
with ~9 million objects across 1300 packs. The counting time
dropped from 413s to 45s, an improvement of about 89%.

Note that this patch won't do anything by itself for a
normal "git gc", as it uses both --honor-pack-keep and
--local.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a2f8cfd..55ef5a8 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -981,6 +981,8 @@ static int want_object_in_pack(const unsigned char *sha1,
 				return 0;
 			if (ignore_packed_keep && p->pack_local && p->pack_keep)
 				return 0;
+			if (!ignore_packed_keep && !local)
+				break;
 		}
 	}
 
-- 
2.9.2.512.g8a06708


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] pack-objects: compute local/ignore_pack_keep early
  2016-07-25 18:49 [PATCH 0/2] speed up "Counting objects" when there are many packs Jeff King
  2016-07-25 18:50 ` [PATCH 1/2] pack-objects: break out of want_object loop early Jeff King
@ 2016-07-25 18:50 ` Jeff King
  1 sibling, 0 replies; 13+ messages in thread
From: Jeff King @ 2016-07-25 18:50 UTC (permalink / raw)
  To: git

In want_object_in_pack(), we can exit early from our loop if
neither "local" nor "ignore_pack_keep" are set. If they are,
however, we must examine each pack to see if it has the
object and is non-local or has a ".keep".

It's quite common for there to be no non-local or .keep
packs at all, in which case we know ahead of time that
looking further will be pointless. We can pre-compute this
by simply iterating over the list of packs ahead of time,
and dropping the flags if there are no packs that could
match.

Another similar strategy would be to modify the loop in
want_object_in_pack() to notice that we have already found
the object once, and that we are looping only to check for
"local" and "keep" attributes. If a pack has neither of
those, we can skip the call to find_pack_entry_one(), which
is the expensive part of the loop.

This has two advantages:

  - it isn't all-or-nothing; we still get some improvement
    when there's a small number of kept or non-local packs,
    and a large number of non-kept local packs

  - it eliminates any possible race where we add new
    non-local or kept packs after our initial scan. In
    practice, I don't think this race matters; we already
    cache the packed_git information, so somebody who adds a
    new pack or .keep file after we've started will not be
    noticed at all, unless we happen to need to call
    reprepare_packed_git() because a lookup fails.

    In other words, we're already racy, and the race is not
    a big deal (losing the race means we might include an
    object in the pack that would not otherwise be, which is
    an acceptable outcome).

However, it also has a disadvantage: we still loop over the
rest of the packs for each object to check their flags. This
is much less expensive than doing the object lookup, but
still not free. So if we wanted to implement that strategy
to cover the non-all-or-nothing cases, we could do so in
addition to this one (so you get the most speedup in the
all-or-nothing case, and the best we can do in the other
cases). But given that the all-or-nothing case is likely the
most common, it is probably not worth the trouble, and we
can revisit this later if evidence points otherwise.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 55ef5a8..c5e343c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -46,6 +46,7 @@ static int keep_unreachable, unpack_unreachable, include_tag;
 static unsigned long unpack_unreachable_expiration;
 static int pack_loose_unreachable;
 static int local;
+static int have_non_local_packs;
 static int incremental;
 static int ignore_packed_keep;
 static int allow_ofs_delta;
@@ -981,7 +982,7 @@ static int want_object_in_pack(const unsigned char *sha1,
 				return 0;
 			if (ignore_packed_keep && p->pack_local && p->pack_keep)
 				return 0;
-			if (!ignore_packed_keep && !local)
+			if (!ignore_packed_keep && (!local || !have_non_local_packs))
 				break;
 		}
 	}
@@ -2785,6 +2786,28 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		progress = 2;
 
 	prepare_packed_git();
+	if (ignore_packed_keep) {
+		struct packed_git *p;
+		for (p = packed_git; p; p = p->next)
+			if (p->pack_local && p->pack_keep)
+				break;
+		if (!p) /* no keep-able packs found */
+			ignore_packed_keep = 0;
+	}
+	if (local) {
+		/*
+		 * unlike ignore_packed_keep above, we do not want to
+		 * unset "local" based on looking at packs, as it
+		 * also covers non-local objects
+		 */
+		struct packed_git *p;
+		for (p = packed_git; p; p = p->next) {
+			if (!p->pack_local) {
+				have_non_local_packs = 1;
+				break;
+			}
+		}
+	}
 
 	if (progress)
 		progress_state = start_progress(_("Counting objects"), 0);
-- 
2.9.2.512.g8a06708

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] pack-objects: break out of want_object loop early
  2016-07-25 18:50 ` [PATCH 1/2] pack-objects: break out of want_object loop early Jeff King
@ 2016-07-25 19:56   ` Junio C Hamano
  2016-07-25 21:41     ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2016-07-25 19:56 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> When pack-objects collects the list of objects to pack
> (either from stdin, or via its internal rev-list), it
> filters each one through want_object_in_pack().
>
> This function loops through each existing packfile, looking
> for the object. When we find it, we mark the pack/offset
> combo for later use. However, we can't just return "yes, we
> want it" at that point. If --honor-pack-keep is in effect,
> we must keep looking to find it in _all_ packs, to make sure
> none of them has a .keep. Likewise, if --local is in effect,
> we must make sure it is not present in any local pack.

s/any local pack/any non-local pack/, no?

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index a2f8cfd..55ef5a8 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -981,6 +981,8 @@ static int want_object_in_pack(const unsigned char *sha1,
>  				return 0;
>  			if (ignore_packed_keep && p->pack_local && p->pack_keep)
>  				return 0;
> +			if (!ignore_packed_keep && !local)
> +				break;
>  		}
>  	}

OK, so in this loop, we may return "false" (meaning, we do not want
to pack the object) if "local" (do not pack objects that appear in
non-local packs) or "ignore_packed_keep" (do not pack objects that
appear in locally kept packs) are in effect, but if neither of the
options is set, we know that one of the preconditions ("local" or
"ignore_packed_keep") for these two "reject by returning false" if
statements would never trigger for any pack on packed_git list, so
it is safe to break out and return the one that we have found.

If that is what is going on, I would have expected to see this early
break before these "we found that this is available in borrowed pack
and we are only packing local" and "we ignore objects in locally
kept packs" checks return false.

Or am I not following the logic in the loop correctly?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] pack-objects: break out of want_object loop early
  2016-07-25 19:56   ` Junio C Hamano
@ 2016-07-25 21:41     ` Jeff King
  2016-07-25 21:52       ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2016-07-25 21:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Jul 25, 2016 at 12:56:23PM -0700, Junio C Hamano wrote:

> > This function loops through each existing packfile, looking
> > for the object. When we find it, we mark the pack/offset
> > combo for later use. However, we can't just return "yes, we
> > want it" at that point. If --honor-pack-keep is in effect,
> > we must keep looking to find it in _all_ packs, to make sure
> > none of them has a .keep. Likewise, if --local is in effect,
> > we must make sure it is not present in any local pack.
> 
> s/any local pack/any non-local pack/, no?

Oops, yeah.

> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index a2f8cfd..55ef5a8 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -981,6 +981,8 @@ static int want_object_in_pack(const unsigned char *sha1,
> >  				return 0;
> >  			if (ignore_packed_keep && p->pack_local && p->pack_keep)
> >  				return 0;
> > +			if (!ignore_packed_keep && !local)
> > +				break;
> >  		}
> >  	}
> 
> OK, so in this loop, we may return "false" (meaning, we do not want
> to pack the object) if "local" (do not pack objects that appear in
> non-local packs) or "ignore_packed_keep" (do not pack objects that
> appear in locally kept packs) are in effect, but if neither of the
> options is set, we know that one of the preconditions ("local" or
> "ignore_packed_keep") for these two "reject by returning false" if
> statements would never trigger for any pack on packed_git list, so
> it is safe to break out and return the one that we have found.

Correct.

> If that is what is going on, I would have expected to see this early
> break before these "we found that this is available in borrowed pack
> and we are only packing local" and "we ignore objects in locally
> kept packs" checks return false.
> 
> Or am I not following the logic in the loop correctly?

Yeah, I think that would work. It has to come after "did we find this in
the pack", obviously. And it has to come after the other unrelated
checks ("are we just finding it to exclude?" and "are we
incremental?"). But you could do:

  if (!*found_pack) {
    ... first find! fill in found pack, etc ...
  }
  if (exclude)
	return 1;
  if (incremental)
	return 0;
  if (!ignore_packed_keep && !local)
	break; /* effectively return 1, but I think the break is more clear */
  if (local && !p->pack_local)
	return 0;
  if (ignore_packed_keep && p->pack_local && p->pack_keep)
	return 0;

which just bumps it up. I don't think there is a way to make it more
elegant, e.g., by only checking ignore_packed_keep once, because we have
to distinguish between each condition being set independently, or the
case where neither is set.

So I stuck the new check at the end, because to me logically it was "can
we break out of the loop instead of looking at p->next". But I agree it
would be equivalent to place it before the related checks, and I don't
mind doing that if you think it's more readable.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] pack-objects: break out of want_object loop early
  2016-07-25 21:41     ` Jeff King
@ 2016-07-25 21:52       ` Junio C Hamano
  2016-07-25 22:14         ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2016-07-25 21:52 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

>   if (!*found_pack) {
>     ... first find! fill in found pack, etc ...
>   }
>   if (exclude)
> 	return 1;
>   if (incremental)
> 	return 0;
>   if (!ignore_packed_keep && !local)
> 	break; /* effectively return 1, but I think the break is more clear */
>   if (local && !p->pack_local)
> 	return 0;
>   if (ignore_packed_keep && p->pack_local && p->pack_keep)
> 	return 0;
>
> which just bumps it up. I don't think there is a way to make it more
> elegant, e.g., by only checking ignore_packed_keep once, because we have
> to distinguish between each condition being set independently, or the
> case where neither is set.
>
> So I stuck the new check at the end, because to me logically it was "can
> we break out of the loop instead of looking at p->next". But I agree it
> would be equivalent to place it before the related checks, and I don't
> mind doing that if you think it's more readable.

I do not mind too much about having to check two bools twice.  But
given that the reason why I was confused was because I didn't see
why we need to pass the two "return 0" conditions at least once
before we decide that we do not need the "return 0" thing at all,
and started constructing a case where this might break by writing
"Suppose you have two packs, one remote and one local in packed_git
list in this order, and ..." before I realized that the new "early
break" can be hoisted up like the above, I definitely feel that "we
found one, and we aren't conditionally pretending that this thing
does not need to be packed at all, so return early and say we want
to pack it" is easier to understand before the two existing "if"
statements.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] pack-objects: break out of want_object loop early
  2016-07-25 21:52       ` Junio C Hamano
@ 2016-07-25 22:14         ` Jeff King
  2016-07-26 20:38           ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2016-07-25 22:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Jul 25, 2016 at 02:52:24PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >   if (!*found_pack) {
> >     ... first find! fill in found pack, etc ...
> >   }
> >   if (exclude)
> > 	return 1;
> >   if (incremental)
> > 	return 0;
> >   if (!ignore_packed_keep && !local)
> > 	break; /* effectively return 1, but I think the break is more clear */
> >   if (local && !p->pack_local)
> > 	return 0;
> >   if (ignore_packed_keep && p->pack_local && p->pack_keep)
> > 	return 0;
> >
> > which just bumps it up. I don't think there is a way to make it more
> > elegant, e.g., by only checking ignore_packed_keep once, because we have
> > to distinguish between each condition being set independently, or the
> > case where neither is set.
> >
> > So I stuck the new check at the end, because to me logically it was "can
> > we break out of the loop instead of looking at p->next". But I agree it
> > would be equivalent to place it before the related checks, and I don't
> > mind doing that if you think it's more readable.
> 
> I do not mind too much about having to check two bools twice.  But
> given that the reason why I was confused was because I didn't see
> why we need to pass the two "return 0" conditions at least once
> before we decide that we do not need the "return 0" thing at all,
> and started constructing a case where this might break by writing
> "Suppose you have two packs, one remote and one local in packed_git
> list in this order, and ..." before I realized that the new "early
> break" can be hoisted up like the above, I definitely feel that "we
> found one, and we aren't conditionally pretending that this thing
> does not need to be packed at all, so return early and say we want
> to pack it" is easier to understand before the two existing "if"
> statements.

Ah, right. Now you had me second-guessing for a moment that there might
be a bad case in hoisting it up where we would want to return 0 but
would break out early to the "return 1".

But it cannot be the case, because the break is mutually exclusive with
the two conditions.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] pack-objects: break out of want_object loop early
  2016-07-25 22:14         ` Jeff King
@ 2016-07-26 20:38           ` Junio C Hamano
  2016-07-26 20:48             ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2016-07-26 20:38 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

>> I do not mind too much about having to check two bools twice.  But
>> given that the reason why I was confused was because I didn't see
>> why we need to pass the two "return 0" conditions at least once
>> before we decide that we do not need the "return 0" thing at all,
>> and started constructing a case where this might break by writing
>> "Suppose you have two packs, one remote and one local in packed_git
>> list in this order, and ..." before I realized that the new "early
>> break" can be hoisted up like the above, I definitely feel that "we
>> found one, and we aren't conditionally pretending that this thing
>> does not need to be packed at all, so return early and say we want
>> to pack it" is easier to understand before the two existing "if"
>> statements.
>
> Ah, right. Now you had me second-guessing for a moment that there might
> be a bad case in hoisting it up where we would want to return 0 but
> would break out early to the "return 1".
>
> But it cannot be the case, because the break is mutually exclusive with
> the two conditions.

Here is what I amended looks like (with s/local/non-local/ in the
log message).

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a2f8cfd..a46bf5b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -977,6 +977,21 @@ static int want_object_in_pack(const unsigned char *sha1,
 				return 1;
 			if (incremental)
 				return 0;
+
+			/*
+			 * When asked to do --local (do not include an
+			 * object that appears in a pack we borrow
+			 * from elsewhere) or --honor-pack-keep (do not
+			 * include an object that appears in a pack marked
+			 * with .keep), we need to make sure no copy of this
+			 * object come from in _any_ pack that causes us to
+			 * omit it, and need to complete this loop.  When
+			 * neither option is in effect, we know the object
+			 * we just found is going to be packed, so break
+			 * out of the loop to return 1 now.
+			 */
+			if (!ignore_packed_keep && !local)
+				break;
 			if (local && !p->pack_local)
 				return 0;
 			if (ignore_packed_keep && p->pack_local && p->pack_keep)

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] pack-objects: break out of want_object loop early
  2016-07-26 20:38           ` Junio C Hamano
@ 2016-07-26 20:48             ` Jeff King
  2016-07-26 21:38               ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2016-07-26 20:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Jul 26, 2016 at 01:38:47PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >> I do not mind too much about having to check two bools twice.  But
> >> given that the reason why I was confused was because I didn't see
> >> why we need to pass the two "return 0" conditions at least once
> >> before we decide that we do not need the "return 0" thing at all,
> >> and started constructing a case where this might break by writing
> >> "Suppose you have two packs, one remote and one local in packed_git
> >> list in this order, and ..." before I realized that the new "early
> >> break" can be hoisted up like the above, I definitely feel that "we
> >> found one, and we aren't conditionally pretending that this thing
> >> does not need to be packed at all, so return early and say we want
> >> to pack it" is easier to understand before the two existing "if"
> >> statements.
> >
> > Ah, right. Now you had me second-guessing for a moment that there might
> > be a bad case in hoisting it up where we would want to return 0 but
> > would break out early to the "return 1".
> >
> > But it cannot be the case, because the break is mutually exclusive with
> > the two conditions.
> 
> Here is what I amended looks like (with s/local/non-local/ in the
> log message).

Thanks, I was actually just preparing a very similar patch (to move the
condition and to add a comment, since clearly it is tricky).

I got side-tracked by adding a t/perf test to show off the improvement.
It's rather tricky to get right and takes a long time to run. I _think_
I have it now, but am waiting for results. :)

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index a2f8cfd..a46bf5b 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -977,6 +977,21 @@ static int want_object_in_pack(const unsigned char *sha1,
>  				return 1;
>  			if (incremental)
>  				return 0;
> +
> +			/*
> +			 * When asked to do --local (do not include an
> +			 * object that appears in a pack we borrow
> +			 * from elsewhere) or --honor-pack-keep (do not
> +			 * include an object that appears in a pack marked
> +			 * with .keep), we need to make sure no copy of this
> +			 * object come from in _any_ pack that causes us to
> +			 * omit it, and need to complete this loop.  When
> +			 * neither option is in effect, we know the object
> +			 * we just found is going to be packed, so break
> +			 * out of the loop to return 1 now.
> +			 */
> +			if (!ignore_packed_keep && !local)
> +				break;

This looks great. Given the explanation in the comment, it might be more
clear to switch the break to "return 1", but I could go either way.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] pack-objects: break out of want_object loop early
  2016-07-26 20:48             ` Jeff King
@ 2016-07-26 21:38               ` Junio C Hamano
  2016-07-27 21:13                 ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2016-07-26 21:38 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> I got side-tracked by adding a t/perf test to show off the improvement.
> It's rather tricky to get right and takes a long time to run. I _think_
> I have it now, but am waiting for results. :)

Well, then I'd stop here and wait for the reroll to requeue.

Thanks.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] pack-objects: break out of want_object loop early
  2016-07-26 21:38               ` Junio C Hamano
@ 2016-07-27 21:13                 ` Jeff King
  2016-07-27 21:28                   ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2016-07-27 21:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Jul 26, 2016 at 02:38:28PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I got side-tracked by adding a t/perf test to show off the improvement.
> > It's rather tricky to get right and takes a long time to run. I _think_
> > I have it now, but am waiting for results. :)
> 
> Well, then I'd stop here and wait for the reroll to requeue.

So I hoped to follow up with my little perf-testing patch last night,
but the rabbit hole goes deeper. :)

It turns out that I could not replicate my earlier results using a perf
script like the one below. The problem is that it heavily depends on the
ordering of your packs. The early-break does not do any good if you
mostly end up finding your objects in the final pack in the list. And
that is the exact situation my perf script creates: imagine you had a
big repository, then 1000 pushes came in, and then you tried to run "git
repack". We sort reverse-chronologically, so the big pack is at the end,
and most lookups have to go through the entire pack list to get there.

In the normal object-lookup paths, we cache the last_found_pack and
check it first, but this loop has no equivalent (and until now it
wouldn't have helped much, because we generally had to walk the whole
list anyway).

I knew this was a potential issue and even had experimented this back
when the original patches were written, but in my experiments then it
didn't help much. That makes sense, though; any case that _was_ improved
by the first two patches would not benefit from reordering the pack
lookups, because that meant it was finding the objects early in the
search (or else the first two patches would have helped not at all).

And I think the particular case I was experimenting on back then was not
a normal "oops, we had some pushes and forgot to run gc". It was more
bizarre, and had several packs that had most of the objects duplicated.

So I think this is an area worth looking into; a series of small
incremental packs on top of a large one is what I would expect to be
the most common case, and the first two patches don't handle it well.

I tried a hacky version of the last_found_pack trick, and it does cut
the counting phase in half for my perf test. But no patch yet. One is
just that doing it cleanly is a little tricky. But two is that I've
wondered if we can do even better with a most-recently-used cache
instead of the last_pack_found hack. So I'm trying to implement and
measure that (both for this loop, and to see if it does better in
find_pack_entry).

Perf script below is just for a sneak peek at what I'm trying to
measure.

-Peff

-- >8 --
#!/bin/sh

test_description='performance with large numbers of packs'
. ./perf-lib.sh

test_perf_large_repo

# Pretend we just have a single branch and no reflogs, and that everything is
# in objects/pack; that makes our fake pack-building in the next step much
# simpler.
test_expect_success 'simplify reachability' '
	tip=$(git rev-parse --verify HEAD) &&
	git for-each-ref --format="option no-deref%0adelete %(refname)" |
	git update-ref --stdin &&
	rm -rf .git/logs &&
	git update-ref refs/heads/master $tip &&
	git symbolic-ref HEAD refs/heads/master &&
	git repack -ad
'

# A real many-pack situation would probably come from having a lot of pushes
# over time. We don't know how big each push would be, but we can fake it by
# just walking the first-parent chain and having each commit be its own "push".
# This isn't _entirely_ accurate, as real pushes would have some duplicate
# objects due to thin-pack fixing, but it's a reasonable approximation.
#
# And then all of the rest of the objects can go in a single packfile that
# represents the state before any of those pushes (actually, we'll generate
# that first because in such a setup it would be the oldest pack, and we sort
# the packs by reverse mtime inside git).
#
# We prepare this in a staging area, because we need to install our baseline
# set of packs for each iteration of the perf test (which unfortunately counts
# against their times, but is a limitation of the perf framework).
test_expect_success 'create a large number of packs' '
	mkdir staging &&

	pushes() {
		git rev-list --first-parent -1000 HEAD
	} &&
	pack() {
		git pack-objects --revs --delta-base-offset staging/pack
	} &&

	bottom=$(pushes | tail -n 1) &&
	echo "$bottom^" | pack &&

	pushes |
	while read rev
	do
		printf "%s\n^%s^" $rev $rev | pack ||
			return 1
	done &&

	setup_many_packs () {
		rm -f .git/objects/pack/* &&
		cp staging/* .git/objects/pack/
	} &&
	setup_many_packs
'

test_perf 'rev-list' '
	git rev-list --objects --all
'

test_perf 'full repack' '
	setup_many_packs &&
	git repack -ad
'

test_done

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] pack-objects: break out of want_object loop early
  2016-07-27 21:13                 ` Jeff King
@ 2016-07-27 21:28                   ` Junio C Hamano
  2016-07-27 22:04                     ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2016-07-27 21:28 UTC (permalink / raw)
  To: Jeff King; +Cc: Git Mailing List

On Wed, Jul 27, 2016 at 2:13 PM, Jeff King <peff@peff.net> wrote:
> ... But two is that I've
> wondered if we can do even better with a most-recently-used cache
> instead of the last_pack_found hack. So I'm trying to implement and
> measure that (both for this loop, and to see if it does better in
> find_pack_entry).

It is always delightful to hear a well constructed description of a
thought process. Thanks.

One thing that made me wonder was what would happen to the
last_found that is static to has_sha1_pack_kept_or_nonlocal()
funciton, when we invalidate the packed_git list, but within the
context of pack-objects it is not likely?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] pack-objects: break out of want_object loop early
  2016-07-27 21:28                   ` Junio C Hamano
@ 2016-07-27 22:04                     ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2016-07-27 22:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On Wed, Jul 27, 2016 at 02:28:32PM -0700, Junio C Hamano wrote:

> On Wed, Jul 27, 2016 at 2:13 PM, Jeff King <peff@peff.net> wrote:
> > ... But two is that I've
> > wondered if we can do even better with a most-recently-used cache
> > instead of the last_pack_found hack. So I'm trying to implement and
> > measure that (both for this loop, and to see if it does better in
> > find_pack_entry).
> 
> It is always delightful to hear a well constructed description of a
> thought process. Thanks.
> 
> One thing that made me wonder was what would happen to the
> last_found that is static to has_sha1_pack_kept_or_nonlocal()
> funciton, when we invalidate the packed_git list, but within the
> context of pack-objects it is not likely?

I wondered that, too, and that's part of what I'm working on. :)

We leave the old packed_git structs in place when we re-read the pack
directory. That works because either we have them already opened (so
even though they're gone, we can still access them, and that's why we
_must_ keep the structs in place), or we will call is_pack_valid() and
notice that it went away (and quietly skip to checking the next pack).

There is one place where we do free the packed_git, but I actually think
we can stop doing so. Here's what I'm planning as part of my series:


-- >8 --
Subject: [PATCH] sha1_file: drop free_pack_by_name

The point of this function is to drop an entry from the
"packed_git" cache that points to a file we might be
overwriting, because our contents may not be the same (and
hence the only caller was pack-objects as it moved a
temporary packfile into place).

In older versions of git, this could happen because the
names of packfiles were derived from the set of objects they
contained, not the actual bits on disk. But since 1190a1a
(pack-objects: name pack files after trailer hash,
2013-12-05), the name reflects the actual bits on disk, and
any two packfiles with the same name can be used
interchangeably.

Dropping this function not only saves a few lines of code,
it makes the lifetime of "struct packed_git" much easier to
reason about: namely, we now do not ever free these structs.

Signed-off-by: Jeff King <peff@peff.net>
---
I won't be surprised if this fixes some obscure bug, because there may
be parts of the code that store a pointer to our packed_git that could
be invalidated (I'm certain the bitmap code does). But perhaps it
doesn't matter because it's rare for this function to trigger at all.

 cache.h      |  1 -
 pack-write.c |  1 -
 sha1_file.c  | 30 ------------------------------
 3 files changed, 32 deletions(-)

diff --git a/cache.h b/cache.h
index 3855ddf..57ef726 100644
--- a/cache.h
+++ b/cache.h
@@ -1416,7 +1416,6 @@ extern unsigned char *use_pack(struct packed_git *, struct pack_window **, off_t
 extern void close_pack_windows(struct packed_git *);
 extern void close_all_packs(void);
 extern void unuse_pack(struct pack_window **);
-extern void free_pack_by_name(const char *);
 extern void clear_delta_base_cache(void);
 extern struct packed_git *add_packed_git(const char *path, size_t path_len, int local);
 
diff --git a/pack-write.c b/pack-write.c
index 33293ce..ea0b788 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -354,7 +354,6 @@ void finish_tmp_packfile(struct strbuf *name_buffer,
 		die_errno("unable to make temporary index file readable");
 
 	strbuf_addf(name_buffer, "%s.pack", sha1_to_hex(sha1));
-	free_pack_by_name(name_buffer->buf);
 
 	if (rename(pack_tmp_name, name_buffer->buf))
 		die_errno("unable to rename temporary pack file");
diff --git a/sha1_file.c b/sha1_file.c
index d5e1121..e045d2f 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -891,36 +891,6 @@ void close_pack_index(struct packed_git *p)
 	}
 }
 
-/*
- * This is used by git-repack in case a newly created pack happens to
- * contain the same set of objects as an existing one.  In that case
- * the resulting file might be different even if its name would be the
- * same.  It is best to close any reference to the old pack before it is
- * replaced on disk.  Of course no index pointers or windows for given pack
- * must subsist at this point.  If ever objects from this pack are requested
- * again, the new version of the pack will be reinitialized through
- * reprepare_packed_git().
- */
-void free_pack_by_name(const char *pack_name)
-{
-	struct packed_git *p, **pp = &packed_git;
-
-	while (*pp) {
-		p = *pp;
-		if (strcmp(pack_name, p->pack_name) == 0) {
-			clear_delta_base_cache();
-			close_pack(p);
-			free(p->bad_object_sha1);
-			*pp = p->next;
-			if (last_found_pack == p)
-				last_found_pack = NULL;
-			free(p);
-			return;
-		}
-		pp = &p->next;
-	}
-}
-
 static unsigned int get_max_fd_limit(void)
 {
 #ifdef RLIMIT_NOFILE
-- 
2.9.2.607.g98dce7b


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-07-27 22:04 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-25 18:49 [PATCH 0/2] speed up "Counting objects" when there are many packs Jeff King
2016-07-25 18:50 ` [PATCH 1/2] pack-objects: break out of want_object loop early Jeff King
2016-07-25 19:56   ` Junio C Hamano
2016-07-25 21:41     ` Jeff King
2016-07-25 21:52       ` Junio C Hamano
2016-07-25 22:14         ` Jeff King
2016-07-26 20:38           ` Junio C Hamano
2016-07-26 20:48             ` Jeff King
2016-07-26 21:38               ` Junio C Hamano
2016-07-27 21:13                 ` Jeff King
2016-07-27 21:28                   ` Junio C Hamano
2016-07-27 22:04                     ` Jeff King
2016-07-25 18:50 ` [PATCH 2/2] pack-objects: compute local/ignore_pack_keep early Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).