Re: [PATCH 2/2] builtin/repack.c: implement support for `--cruft-max-size`

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Taylor Blau <me@ttaylorr.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Jeff King <peff@peff.net>,
	Jonathan Tan <jonathantanmy@google.com>
Subject: Re: [PATCH 2/2] builtin/repack.c: implement support for `--cruft-max-size`
Date: Mon, 25 Sep 2023 14:01:19 -0400	[thread overview]
Message-ID: <ZRHK7+2Ta6U4CXuL@nand.local> (raw)
In-Reply-To: <xmqqtts5wnwn.fsf@gitster.g>

On Thu, Sep 07, 2023 at 04:42:48PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > This all works, but can be costly from an I/O-perspective when a
> > repository has either (a) many unreachable objects, (b) prunes objects
> > relatively infrequently/never, or (c) both.
>
> I can certainly understand (a).  If we need to write a lot of
> objects into the craft pack, among which many of them are already in
> the previous generation of craft pack, we would be copying bits from
> the old to the new craft pack, and having to do so for many objects
> would involve expensive I/O.  But not (b).  Whether we prune objects
> infrequently or we do so very often, as long as the number and
> on-disk size of objects that has to go into craft packs are the
> same, wouldn't the cost of I/O pretty much the same?  IOW, (b) does
> not have much to do with how repacking is costly I/O wise, except
> that it is a contributing factor to make (a) worse, which is the
> real cause of the I/O cost.

Yeah, (b) on its own isn't enough to cost us a significant amount of I/O
overhead. (b) definitely exacerbates (a) if we are repacking relatively
frequently. I'll clarify in the patch notes.

> > This limits the I/O churn up to a quadratic function of the value
> > specified by the `--cruft-max-size` option, instead of behaving
> > quadratically in the number of total unreachable objects.
>
> ... I do not quite see how you would limit the I/O churn.

In the above (elided here for brevity) response, you are correct: the
idea is that once a cruft pack has grown to whatever threshold you set
via `--cruft-max-size`, it is effectively frozen, preventing you from
incurring the same I/O churn from it in the future.

> > When pruning unreachable objects, we bypass the new paths which combine
>
> "paths" here refers to...?  code paths, I guess?

Yep.

> > +gc.cruftMaxSize::
> > +	Limit the size of new cruft packs when repacking. When
> > +	specified in addition to `--cruft-max-size`, the command line
> > +	option takes priority. See the `--cruft-max-size` option of
> > +	linkgit:git-repack[1].
>
> Hmph.
>
> I am reasonably sure that I will mix the name up and call it
> gc.maxCruftSize in my configuration file and scratch my head
> wondering why it is not working.

I have no strong preference for either "gc.cruftMaxSize" or
"gc.maxCruftSize" (or anything else, really), so long as we are
consistent with the command-line option.

> > diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
> > index 90806fd26a..8a90d684a7 100644
> > --- a/Documentation/git-gc.txt
> > +++ b/Documentation/git-gc.txt
> > @@ -59,6 +59,13 @@ be performed as well.
> >  	cruft pack instead of storing them as loose objects. `--cruft`
> >  	is on by default.
> >
> > +--cruft-max-size=<n>::
> > +	When packing unreachable objects into a cruft pack, limit the
> > +	size of new cruft packs to be at most `<n>`. Overrides any
> > +	value specified via the `gc.cruftMaxSize` configuration. See
> > +	the `--cruft-max-size` option of linkgit:git-repack[1] for
> > +	more.
>
> At least this side giving --max-cruft-size=<n> (which I think is a
> lot more natural word order) would cause parse-options to give an
> error, so it won't risk mistakes go silently unnoticed.

Yeah, that's compelling. I'm convinced ;-).

> > diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
> > index 4017157949..23fd203d79 100644
> > --- a/Documentation/git-repack.txt
> > +++ b/Documentation/git-repack.txt
> > @@ -74,6 +74,15 @@ to the new separate pack will be written.
> >  	immediately instead of waiting for the next `git gc` invocation.
> >  	Only useful with `--cruft -d`.
> >
> > +--cruft-max-size=<n>::
> > +	Repack cruft objects into packs as large as `<n>` before
> > +	creating new packs. As long as there are enough cruft packs
> > +	smaller than `<n>`, repacking will cause a new cruft pack to
> > +	be created containing objects from any combined cruft packs,
> > +	along with any new unreachable objects. Cruft packs larger
> > +	than `<n>` will not be modified. Only useful with `--cruft
> > +	-d`.
>
> Here, the missing fourth point I pointed out above is mentioned,
> which is good.
>
> Describe the unit for <n> (I am assuming that is counted in bytes,
> honoring the human-friendly suffix, like 100M).

Yep, thanks.

> There may be some "interesting" behaviour around the size boundary,
> no?  If you pack too many objects, your resulting size may slightly
> bust <n> and you will get a complaint, but by fixing that "bug", you
> will always stop short of filling the whole <n> bytes in the
> produced packfiles, and they will not be excempt from rewriting
> (becuase they are not "larger than <n>"), which defeats the point of
> this patch.

Yeah, the boundary conditions are definitely the most interesting part
of this patch IMHO. When packing with `--max-cruft-size`, we still do
cap the pack size of the resulting pack, so any excess spills over into
the next pack.

> Describe that <n> is a threshold that we stop soon after passing to
> explicitly allow us to go beyond it would solve the above problem, I
> would presume.

I am not sure I understand what you're getting at here.

> > diff --git a/builtin/gc.c b/builtin/gc.c
> > index 1f53b66c7b..b6640abd35 100644
> > --- a/builtin/gc.c
> > +++ b/builtin/gc.c
> > @@ -52,6 +52,7 @@ static const char * const builtin_gc_usage[] = {
> >  static int pack_refs = 1;
> >  static int prune_reflogs = 1;
> >  static int cruft_packs = 1;
> > +static char *cruft_max_size;
>
> I do not think this type is a good idea.

Yeah, I marked this as a string because we don't ourselves do anything
with it in 'gc', and instead just immediately pass it down to 'repack'.
We could parse it ourselves and catch any malformed arguments earlier,
though, which sounds worthwhile to me.

> > diff --git a/builtin/repack.c b/builtin/repack.c
> > index 44cb261371..56e7f5f43d 100644
> > --- a/builtin/repack.c
> > +++ b/builtin/repack.c
> > @@ -26,6 +26,9 @@
> >  #define LOOSEN_UNREACHABLE 2
> >  #define PACK_CRUFT 4
> >
> > +#define DELETE_PACK ((void*)(uintptr_t)1)
> > +#define RETAIN_PACK ((uintptr_t)(1<<1))
>
> Shouldn't these look more similar?  That is
>
> 	((void *)(uintptr_t)(1<<0))
> 	((void *)(uintptr_t)(1<<1))

Yeah, these have been shored up from some helpful review on the repack
internals cleanup series I posted earlier.

> > +		if (existing_cruft_nr >= existing->cruft_packs.nr)
> > +			BUG("too many cruft packs (found %"PRIuMAX", but knew "
> > +			    "of %"PRIuMAX")",
>
> Is that a BUG() that somehow miscounted the packs, or can it be a
> runtime error that may happen when a "git push" is pushing new
> objects into the repository, creating a new pack we did not know
> about?  Something like the latter should not be marked a BUG(), but

This would be the former. We load the set of packs at the beginning of a
repack operation from collect_pack_filenames() via a call to
get_all_packs(). So the set won't change between our view of it in
collect_pack_filenames() and collapse_small_cruft_packs(). IOW, we
should expect to see as many cruft packs as we saw in
collect_pack_filenames(), and any difference there would indeed be a
BUG().

> > +			    (uintmax_t)existing_cruft_nr + 1,
> > +			    (uintmax_t)existing->cruft_packs.nr);
> > +		existing_cruft[existing_cruft_nr++] = p;
> > +	}
> > +
> > +	QSORT(existing_cruft, existing_cruft_nr, existing_cruft_pack_cmp);
>
> We use the simplest "from smaller ones to larger ones, combine one
> by one together until the result gets large enough", which would not
> give us the best packing, but it is OK because it is not our goal to
> solve knapsack problem here, I presume?

Exactly. Whatever spill-over we generate (if any) will be the seed of
the next "generation" of cruft packs, which are allowed to grow and
accumulate until they themselves reach the size threshold, at which
point the process starts itself over again.

> > +	for (i = 0; i < existing_cruft_nr; i++) {
> > +		off_t proposed;
> > +
> > +		p = existing_cruft[i];
> > +		proposed = st_add(total_size, p->pack_size);
> > +
> > +		if (proposed <= max_size) {
> > +			total_size = proposed;
> > +			fprintf(in, "-%s\n", pack_basename(p));
> > +		} else {
> > +			retain_cruft_pack(existing, p);
> > +			fprintf(in, "%s\n", pack_basename(p));
> > +		}
> > +	}
>
> This is exactly what I talked about the possibly funny behaviour
> around the boundary earlier, but it may be even worse.  This time,
> we may decide that a pack with size <n-epsilon> is to be retained,
> only because the pack that came next in the existing list happened
> to be larger than epsilon, but next time around, it may not be the
> case (i.e. our pack may be smaller than epsilon, the next one in the
> current round may be larger than epsilon, but before we repack the
> next time, a new pack that is slightly smaller than epsilon that is
> larger than our pack may have been created---now our pack will be
> combined with it), so the algorithm to choose which ones are kept
> does not depend on the pack itself alone but also depends on its
> surroundings.

If I understand your comment correctly, that behavior is as-designed. We
try to grow cruft packs by combining other cruft packs that are not yet
frozen, and I think that is going to be dependent on the entire set of
packs rather than the characteristic of any one pack.

An alternative approach might be to combine *all* cruft packs smaller
than some threshold, and let any spill-over get handled by pack-objects
with its --max-pack-size option. I do not have a strong preference
between the two.

Thanks,
Taylor

next prev parent reply	other threads:[~2023-09-25 18:01 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-07 21:51 [PATCH 0/2] repack: implement `--cruft-max-size` Taylor Blau
2023-09-07 21:52 ` [PATCH 1/2] t7700: split cruft-related tests to t7704 Taylor Blau
2023-09-08  0:01   ` Eric Sunshine
2023-09-07 21:52 ` [PATCH 2/2] builtin/repack.c: implement support for `--cruft-max-size` Taylor Blau
2023-09-07 23:42   ` Junio C Hamano
2023-09-25 18:01     ` Taylor Blau [this message]
2023-09-08 11:21   ` Patrick Steinhardt
2023-10-02 20:30     ` Taylor Blau
2023-10-03  0:44 ` [PATCH v2 0/3] repack: implement `--cruft-max-size` Taylor Blau
2023-10-03  0:44   ` [PATCH v2 1/3] t7700: split cruft-related tests to t7704 Taylor Blau
2023-10-03  0:44   ` [PATCH v2 2/3] builtin/repack.c: parse `--max-pack-size` with OPT_MAGNITUDE Taylor Blau
2023-10-05 11:31     ` Patrick Steinhardt
2023-10-05 17:28       ` Taylor Blau
2023-10-05 20:22         ` Junio C Hamano
2023-10-03  0:44   ` [PATCH v2 3/3] builtin/repack.c: implement support for `--max-cruft-size` Taylor Blau
2023-10-05 12:08     ` Patrick Steinhardt
2023-10-05 17:35       ` Taylor Blau
2023-10-05 20:25       ` Junio C Hamano
2023-10-07 17:20     ` [PATCH] repack: free existing_cruft array after use Jeff King
2023-10-09  1:24       ` Taylor Blau
2023-10-09 17:28         ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZRHK7+2Ta6U4CXuL@nand.local \
    --to=me@ttaylorr.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).