git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, peff@peff.net, dstolee@microsoft.com
Subject: Re: [PATCH v4 8/8] builtin/repack.c: add '--geometric' option
Date: Thu, 4 Mar 2021 16:55:19 -0500	[thread overview]
Message-ID: <YEFXRwyMpyXHgArH@nand.local> (raw)
In-Reply-To: <xmqqv9ahxddp.fsf@gitster.g>

On Wed, Feb 24, 2021 at 03:19:30PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
> > ..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
> > With a geometric factor of 'r', it should be that:
> >
> >   objects(Pi) > r*objects(P(i-1))
> >
> > for all i in [1, n], where the packs are sorted by
> >
> >   objects(P1) <= objects(P2) <= ... <= objects(Pn).
> >
> > Since finding a true optimal repacking is NP-hard, we approximate it
> > along two directions:
> >
> >   1. We assume that there is a cutoff of packs _before starting the
> >      repack_ where everything to the right of that cut-off already forms
> >      a geometric progression (or no cutoff exists and everything must be
> >      repacked).
>
> When you order existing packs like how you explained the next
> "direction" below, do we assume loose ones would sit before
> (i.e. "newer and smaller" than) all of the packs?

Kind of. We don't consider them to be part of any pack when deciding
where to place the split (in other words, we don't consider them at
all until the subsequent repack by which time they are packed).

That's a fine assumption to make (as you note in the reply below this
one), since we'll eventually reach a geometric progression. This
approximation can be as wrong as there are loose objects (but hopefully
there aren't so many by the time we want to do a geometric repack).

> >   2. We assume that everything smaller than the cutoff count must be
> >      repacked. This forms our base assumption, but it can also cause
> >      even the "heavy" packs to get repacked, for e.g., if we have 6
> >      packs containing the following number of objects:
> >
> >        1, 1, 1, 2, 4, 32
> >
> >      then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
> >      rolling up the first two packs into a pack with 2 objects. That
> >      breaks our progression and leaves us:
> >
> >        2, 1, 2, 4, 32
> >          ^
> >
> >      (where the '^' indicates the position of our split). To restore a
> >      progression, we move the split forward (towards larger packs)
> >      joining each pack into our new pack until a geometric progression
> >      is restored. Here, that looks like:
> >
> >        2, 1, 2, 4, 32  ~>  3, 2, 4, 32  ~>  5, 4, 32  ~> ... ~> 9, 32
> >          ^                   ^                ^                   ^
>
> This explanation is very intuitive and easy to understand (I assume
> we aren't actually repacking 1+1 into 2 and then 2+1 into 3 and then
> choosing to repack 3+2 to create 5, but we scan before doing any
> repacking and decide to repack 2+1+2+4 into a single 9).

Correct, and thanks. The split is determined ahead of time before we
actually get to writing any new packs.

> What is not so clear is how this picture changes depending on the
> value of 'r'.

It only means that subsequent packs need to contain at least 'r' times
as many objects as the previous pack does.

> > +static void split_pack_geometry(struct pack_geometry *geometry, int factor)
> > +{
> > +	uint32_t i;
> > +	uint32_t split;
> > +	off_t total_size = 0;
> > +
> > +	if (geometry->pack_nr <= 1) {
> > +		geometry->split = geometry->pack_nr;
> > +		return;
> > +	}
>
> When there is a single pack (or no pack), we place the split to 1
> (let's keep reading with the need to find out what split means in
> mind; it is not yet clear if it points at the pack that will be part
> of the kept set, or at the pack that is the last one among the
> repacked set, at this point in the code).

Everything that is strictly less than the split will get repacked, which
upon reading this again means that we'll repack a repository containing
just a single pack again. That's wasteful, so we may in the future want
to adjust this to set the split to 0 regardless of whether we have zero
or one pack here.

> > +	split = geometry->pack_nr - 1;
> > +
> > +	/*
> > +	 * First, count the number of packs (in descending order of size) which
> > +	 * already form a geometric progression.
> > +	 */
> > +	for (i = geometry->pack_nr - 1; i > 0; i--) {
> > +		struct packed_git *ours = geometry->pack[i];
> > +		struct packed_git *prev = geometry->pack[i - 1];
> > +		if (geometry_pack_weight(ours) >= factor * geometry_pack_weight(prev))
> > +			split--;
> > +		else
> > +			break;
> > +	}
>
> Instead of rolling up from smaller ones like explained in the log
> message, we scan from the larger end and see where the existing
> progression is broken.  When the loop breaks in the middle, the pack
> at position 'i-1' (prev) is too big.
>
> Why do we need to initialize 'split' before the loop and decrement
> it?  Wouldn't it be equivalent to assign 'i' after the loop breaks
> to 'split'?

Yep, they are equivalent.

> In any case, after the loop breaks, the packs starting at position
> 'i+1' (one after ours when the loop broke) thru to the end of the
> geometry->pack[] array are in good progression.  We have 'i' in
> 'split' at this point, so ...
>
> > +	if (split) {
> > +		/*
> > +		 * Move the split one to the right, since the top element in the
> > +		 * last-compared pair can't be in the progression. Only do this
> > +		 * when we split in the middle of the array (otherwise if we got
> > +		 * to the end, then the split is in the right place).
> > +		 */
> > +		split++;
> > +	}
>
> ... we increment it.  It means geometry->pack[split] is small enough
> relative to geometry->pack[split+1] and so on thru to the end of the
> array.
>
> What if split==0 when we exited the loop?  That would mean that the
> everything in the array was in good progression, which is in line
> with the "in the middle" case.  Either way, the pack at 'split' and
> later are in good progression.

Right (and ditto that we wouldn't do anything if split==0 in that case).

>  - we know many numbers are in uint32_t because that is how
>    packfiles limit their contents, but is it safe to perform the
>    multiplication with factor and comparison in that type?

We could arguably be more careful here, yes.

> > @@ -396,9 +525,19 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
> >  		strvec_pushf(&cmd.args, "--keep-pack=%s",
> >  			     keep_pack_list.items[i].string);
> >  	strvec_push(&cmd.args, "--non-empty");
> > -	strvec_push(&cmd.args, "--all");
> > -	strvec_push(&cmd.args, "--reflog");
> > -	strvec_push(&cmd.args, "--indexed-objects");
> > +	if (!geometry) {
> > +		/*
> > +		 * 'git pack-objects' will up all objects loose or packed
>
> "git pack-objects --stdin-packs" will?
> What verb is missing in "will VERB up all objects"?

Likely I meant to say "roll" here before "up".

> > +		 * (either rolling them up or leaving them alone), so don't pass
> > +		 * these options.
> > +		 *
> > +		 * The implementation of 'git pack-objects --stdin-packs'
> > +		 * makes them redundant (and the two are incompatible).
>
> I am not sure if that is true.
>
> More importantly, if you read this comment after you are done with
> the series and no longer feel that geometric repacking is the most
> important thing in the world, you'd realize that an important piece
> of information is missing to help readers.  It talks about what
> "geometric" code does (i.e. uses --stdin-packs hence no need to pass
> these options) in a block that is for !geometric.
>
> 	We need to grab all reachable objects, including those that
> 	are reachable from reflogs and the index.
>
> 	When repacking into a geometric progression of packs,
> 	however, we ask 'git pack-objects --stdin-packs', and it is
> 	not about packing objects based on reachability but about
> 	repacking all the objects in specified packs and loose ones
> 	(indeed, --stdin-packs is incompatible with these options).
>
> or something?  I suspect that --stdin-packs does not make --all and
> others "redundant".  The operation is about creating a new pack out
> of the objects contained in these packs, regardless of the objects'
> reachability from the usual "refs, index and reflogs" anchor points,
> no?

Exactly right. And I am certainly in favor of your wording above. Since
this series is already on next, I'd be happy to pick this up with the
few other minor things above in a separate series to apply on top (but
since I don't think any of these are correctness issues, you should feel
free to continue merging this down in the meantime).

> > @@ -507,6 +666,25 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
> >  			if (!string_list_has_string(&names, sha1))
> >  				remove_redundant_pack(packdir, item->string);
> >  		}
> > +
> > +		if (geometry) {
> > +			struct strbuf buf = STRBUF_INIT;
> > +
> > +			uint32_t i;
> > +			for (i = 0; i < geometry->split; i++) {
> > +				struct packed_git *p = geometry->pack[i];
> > +				if (string_list_has_string(&names,
> > +							   hash_to_hex(p->hash)))
> > +					continue;
> > +
> > +				strbuf_reset(&buf);
> > +				strbuf_addstr(&buf, pack_basename(p));
> > +				strbuf_strip_suffix(&buf, ".pack");
> > +
> > +				remove_redundant_pack(packdir, buf.buf);
> > +			}
> > +			strbuf_release(&buf);
> > +		}
>
> Before this new code, we seem to remove all pre-existing packfiles
> that are not in the output from the pack-objects already.  The only
> reason that code does not harm the geometry case is we assume
> get_non_kept_pack_filenames() call is never made while doing
> geometric repack (iow, ALL_INTO_ONE is not set) and the list of
> pre-existing packfiles &existing_packs is empty.  Am I reading the
> code correctly?
>
>  - It is a bit unnerving to learn (and it will be a maintenance
>    burden in the future) that a variable whose name is
>    existing_packs does not necessarily have a list of existing packs
>    depending on the mode we are operating in.
>
>  - The guard to make geometric incompatible with ALL_INTO_ONE does
>    not mention ALL_INTO_ONE, even though that bit is what would
>    corrupt the resulting repository if overlooked.  We should
>    probably need s/pack_everything/& \& ALL_INTO_ONE/ in the hunk
>     below.

Eek, yes. This is because the geometric code takes its own view of the
pack directory when figuring out where to place to split line, and so it
seemed easier to have separate paths.

I'm not sure whether I maintain that that was a good idea in hindsight
;). Certainly it does create a little bit of a maintenance burden for
us. But they really are two different things: the geometric code really
wants to have the packs laid out in order of object size, while the
"existing" string_list wants packs laid out in lexicographic order of
their filename to check whether certain packs exist or not.

> Other than that, it was a fun patch to read.

Thanks, I think the few suggestions you made here are good ones. I'll
put it on my to-do list of things to clean up in a separate little
series.

Since this is already in next, I would suggest continuing to merge it
down since none of these suggestions impact the patch's correctness.

Thanks,
Taylor

  parent reply	other threads:[~2021-03-04 21:58 UTC|newest]

Thread overview: 120+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-19 23:23 [PATCH 00/10] repack: support repacking into a geometric sequence Taylor Blau
2021-01-19 23:24 ` [PATCH 01/10] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
2021-01-20 13:40   ` Derrick Stolee
2021-01-20 14:38     ` Taylor Blau
2021-01-29  2:33   ` Junio C Hamano
2021-01-29 18:38     ` Taylor Blau
2021-01-29 19:31     ` Jeff King
2021-01-29 20:20       ` Junio C Hamano
2021-01-19 23:24 ` [PATCH 02/10] revision: learn '--no-kept-objects' Taylor Blau
2021-01-29  3:10   ` Junio C Hamano
2021-01-29 19:13     ` Taylor Blau
2021-01-19 23:24 ` [PATCH 03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed' Taylor Blau
2021-01-29  3:21   ` Junio C Hamano
2021-01-29 19:19     ` Jeff King
2021-01-29 20:01       ` Taylor Blau
2021-01-29 20:25         ` Jeff King
2021-01-29 22:10           ` Taylor Blau
2021-01-29 22:57             ` Jeff King
2021-01-29 23:03             ` Junio C Hamano
2021-01-29 23:28               ` Taylor Blau
2021-02-02  3:04                 ` Taylor Blau
2021-01-29 23:31               ` Jeff King
2021-01-29 22:13           ` Junio C Hamano
2021-01-29 20:30       ` Junio C Hamano
2021-01-29 22:43         ` Jeff King
2021-01-29 22:53           ` Taylor Blau
2021-01-29 23:00             ` Jeff King
2021-01-29 23:10             ` Junio C Hamano
2021-01-19 23:24 ` [PATCH 04/10] p5303: add missing &&-chains Taylor Blau
2021-01-19 23:24 ` [PATCH 05/10] p5303: measure time to repack with keep Taylor Blau
2021-01-29  3:40   ` Junio C Hamano
2021-01-29 19:32     ` Jeff King
2021-01-29 20:04       ` [PATCH] p5303: avoid sed GNU-ism Jeff King
2021-01-29 20:19         ` Eric Sunshine
2021-01-29 20:27           ` Jeff King
2021-01-29 20:36             ` Eric Sunshine
2021-01-29 22:11               ` Taylor Blau
2021-01-29 20:38       ` [PATCH 05/10] p5303: measure time to repack with keep Junio C Hamano
2021-01-29 22:10         ` Jeff King
2021-01-29 23:12           ` Junio C Hamano
2021-01-19 23:24 ` [PATCH 06/10] pack-objects: rewrite honor-pack-keep logic Taylor Blau
2021-01-19 23:24 ` [PATCH 07/10] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
2021-01-19 23:24 ` [PATCH 08/10] builtin/pack-objects.c: teach '--keep-pack-stdin' Taylor Blau
2021-01-19 23:24 ` [PATCH 09/10] builtin/repack.c: extract loose object handling Taylor Blau
2021-01-20 13:59   ` Derrick Stolee
2021-01-20 14:34     ` Taylor Blau
2021-01-20 15:51       ` Derrick Stolee
2021-01-21  3:45     ` Junio C Hamano
2021-01-19 23:24 ` [PATCH 10/10] builtin/repack.c: add '--geometric' option Taylor Blau
2021-01-20 14:05 ` [PATCH 00/10] repack: support repacking into a geometric sequence Derrick Stolee
2021-02-04  3:58 ` [PATCH v2 0/8] " Taylor Blau
2021-02-04  3:58   ` [PATCH v2 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
2021-02-16 21:42     ` Jeff King
2021-02-16 21:48       ` Taylor Blau
2021-02-04  3:58   ` [PATCH v2 2/8] revision: learn '--no-kept-objects' Taylor Blau
2021-02-16 23:17     ` Jeff King
2021-02-17 18:35       ` Taylor Blau
2021-02-04  3:59   ` [PATCH v2 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
2021-02-16 23:46     ` Jeff King
2021-02-17 18:59       ` Taylor Blau
2021-02-17 19:21         ` Jeff King
2021-02-04  3:59   ` [PATCH v2 4/8] p5303: add missing &&-chains Taylor Blau
2021-02-04  3:59   ` [PATCH v2 5/8] p5303: measure time to repack with keep Taylor Blau
2021-02-16 23:58     ` Jeff King
2021-02-17  0:02       ` Jeff King
2021-02-17 19:13       ` Taylor Blau
2021-02-17 19:25         ` Jeff King
2021-02-04  3:59   ` [PATCH v2 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
2021-02-17 16:05     ` Jeff King
2021-02-17 19:23       ` Taylor Blau
2021-02-17 19:29         ` Jeff King
2021-02-04  3:59   ` [PATCH v2 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
2021-02-17 17:11     ` Jeff King
2021-02-17 19:54       ` Taylor Blau
2021-02-17 20:25         ` Jeff King
2021-02-17 20:29           ` Taylor Blau
2021-02-17 21:43             ` Jeff King
2021-02-04  3:59   ` [PATCH v2 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
2021-02-17 18:17     ` Jeff King
2021-02-17 20:01       ` Taylor Blau
2021-02-17  0:01   ` [PATCH v2 0/8] repack: support repacking into a geometric sequence Jeff King
2021-02-17 18:18     ` Jeff King
2021-02-18  3:14 ` [PATCH v3 " Taylor Blau
2021-02-18  3:14   ` [PATCH v3 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
2021-02-18  3:14   ` [PATCH v3 2/8] revision: learn '--no-kept-objects' Taylor Blau
2021-02-18  3:14   ` [PATCH v3 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
2021-02-18  3:14   ` [PATCH v3 4/8] p5303: add missing &&-chains Taylor Blau
2021-02-18  3:14   ` [PATCH v3 5/8] p5303: measure time to repack with keep Taylor Blau
2021-02-18  3:14   ` [PATCH v3 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
2021-02-18  3:14   ` [PATCH v3 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
2021-02-18  3:14   ` [PATCH v3 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
2021-02-23  0:31   ` [PATCH v3 0/8] repack: support repacking into a geometric sequence Jeff King
2021-02-23  1:06     ` Taylor Blau
2021-02-23  1:42       ` Jeff King
2021-02-23  2:24 ` [PATCH v4 " Taylor Blau
2021-02-23  2:25   ` [PATCH v4 1/8] packfile: introduce 'find_kept_pack_entry()' Taylor Blau
2021-02-23  2:25   ` [PATCH v4 2/8] revision: learn '--no-kept-objects' Taylor Blau
2021-02-23  2:25   ` [PATCH v4 3/8] builtin/pack-objects.c: add '--stdin-packs' option Taylor Blau
2021-02-23  8:07     ` Junio C Hamano
2021-02-23 18:51       ` Jeff King
2021-02-23  2:25   ` [PATCH v4 4/8] p5303: add missing &&-chains Taylor Blau
2021-02-23  2:25   ` [PATCH v4 5/8] p5303: measure time to repack with keep Taylor Blau
2021-02-23  2:25   ` [PATCH v4 6/8] builtin/pack-objects.c: rewrite honor-pack-keep logic Taylor Blau
2021-02-23  2:25   ` [PATCH v4 7/8] packfile: add kept-pack cache for find_kept_pack_entry() Taylor Blau
2021-02-23  2:25   ` [PATCH v4 8/8] builtin/repack.c: add '--geometric' option Taylor Blau
2021-02-24 23:19     ` Junio C Hamano
2021-02-24 23:43       ` Junio C Hamano
2021-03-04 21:40         ` Taylor Blau
2021-03-04 21:55       ` Taylor Blau [this message]
2021-02-23  3:39   ` [PATCH v4 0/8] repack: support repacking into a geometric sequence Jeff King
2021-02-23  7:43   ` Junio C Hamano
2021-02-23 18:44     ` Jeff King
2021-02-23 19:54       ` Martin Fick
2021-02-23 20:06         ` Taylor Blau
2021-02-23 21:57           ` Martin Fick
2021-02-23 20:15         ` Jeff King
2021-02-23 21:41           ` Martin Fick
2021-02-23 21:53             ` Jeff King
2021-02-24 18:13               ` Martin Fick
2021-02-26  6:23                 ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YEFXRwyMpyXHgArH@nand.local \
    --to=me@ttaylorr.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).