git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Michael Haggerty <mhagger@alum.mit.edu>
Subject: Re: [PATCH v2 7/7] pack-objects: use mru list when iterating over packs
Date: Tue, 9 Aug 2016 13:45:29 -0400	[thread overview]
Message-ID: <20160809174528.2ydgkhd7ayclat3t@sigill.intra.peff.net> (raw)
In-Reply-To: <20160809140411.7745apztp36nwshx@sigill.intra.peff.net>

On Tue, Aug 09, 2016 at 10:04:11AM -0400, Jeff King wrote:

> Here's the code to do the cycle-breaking.
> [...]
> It seems to perform well, and it does break the cycles (the later check
> during the write does not kick in, and we get no warnings). I didn't dig
> into the fates of specific objects, but any cycles should be added to
> the delta-compression phase.

Curious about that last part, I did the repack both with and without
this patch. We should be able to see the difference in the progress
meter from the "Compressing" phase. And indeed, it goes from 3030778
objects to 3031056, an increase of 278 objects. Which just so happens to
be exactly the number of "warning: recursive delta detected" lines that
are printed without the patch.

Not surprising, but a nice confirmation that everything is working as
expected.  But while looking at this, I managed to unravel what's been
going on this whole time.

I ran the repack again with stock git (no MRU patch), and the number of
objects in the delta phase jumped to 3087880, around 56,000 more than
with the MRU patch. So that's probably where the magic "3%" is coming
from.  By looking at the smaller packs first, we are more likely to find
copies of objects which _aren't_ deltas, and then consider them for
delta compression. And that compression tends to find a better match
than reusing what was in the big pack, and we end up with a slightly
smaller output pack.

So why are the deltas we find so much better than what is in the big
pack?

There's something rather subtle going on, and it has to do with the fact
that the existing pack was _not_ made by a stock version of git.

I may have mentioned before that I have some patches which restrict the
delta selection process. The goal is that if you have several "islands"
of refs (in my case, refs corresponding to particular forks of a
repository), it's desirable that you don't make a delta against a base
that is not reachable from all of the same islands. Otherwise, when
somebody fetches or clones the fork, you have to throw away the delta
and find one from scratch. If this happens a lot, it gets expensive.
Unsurprisingly, the packfiles produced with the island-restrictions
turned on are slightly worse than normal packfiles (because we reject
some delta possibilities that might be better).

When I did my testing, I did not enable the delta islands, because I
didn't want them to influence the results. But of course their effect
could still be seen in the deltas in the existing packfile, because it
came off our production servers (because I wanted a real-world test
case). So any pack ordering which ends up reusing _fewer_ of those
deltas is going to be a net win in size, because it's going to find new
deltas without being hampered by the island restrictions. And that's
exactly what happened in my tests; the stock pack ordering finds more
objects in the newer packs, has fewer deltas to reuse, and is able to
find better unrestricted deltas.

Unfortunately I don't have a real-world test case that's been packed up
until now with stock git. So the next-best thing I can do is re-run my
tests with the island restrictions turned on, so that newly-found deltas
do not have an unfair advantage. And here are the sizes for each case
after doing so:

  7848512375 before.git/objects/pack/pack-2f5878f6569cd6682e7084318de311ed26d19af5.pack
  7851778788 mru.git/objects/pack/pack-7694766c660eaa0e27e4e51a77bd5c457c3d2f1d.pack
  7851165480 cycle-break.git/objects/pack/pack-cfaceb3423993c72e4ac588c30c94da2d087145a.pack

The MRU case _is_ slightly bigger, but only by 0.04%. And then adding
the cycle-breaking on top of that reclaims a little bit of space,
because rather than throwing out those 287 deltas at the write phase,
it's able to add those objects back to the delta search and find new
deltas for them.

So there. Mystery solved, and I feel confident in saying that the MRU
patches _don't_ represent a real regression in pack size. We do need to
deal with the cycles, of course, but it's less an issue of size and more
one of correctness (well, the pack is correct either way, but still, it
seems more correct to realize during the correct phase that we cannot
reuse those deltas).

I also suspect that one of the tricks I tried, simply reversing the pack
order (so the big pack is at the front, but the order is stable) would
work very well for this case. But as I mentioned before, I prefer the
MRU strategy because it's less susceptible to pathological cases.

-Peff

  reply	other threads:[~2016-08-09 17:45 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-29  4:04 [PATCH v2 0/7] speed up pack-objects counting with many packs Jeff King
2016-07-29  4:06 ` [PATCH v2 1/7] t/perf: add tests for many-pack scenarios Jeff King
2016-07-29  4:06 ` [PATCH v2 2/7] sha1_file: drop free_pack_by_name Jeff King
2016-07-29  4:06 ` [PATCH v2 3/7] add generic most-recently-used list Jeff King
2016-07-29  4:09 ` [PATCH v2 4/7] find_pack_entry: replace last_found_pack with MRU cache Jeff King
2016-07-29  4:10 ` [PATCH v2 5/7] pack-objects: break out of want_object loop early Jeff King
2016-07-29  4:11 ` [PATCH v2 6/7] pack-objects: compute local/ignore_pack_keep early Jeff King
2016-07-29  4:15 ` [PATCH v2 7/7] pack-objects: use mru list when iterating over packs Jeff King
2016-07-29  5:45   ` Jeff King
2016-07-29 15:02     ` Junio C Hamano
2016-08-08 14:50       ` Jeff King
2016-08-08 16:28         ` Junio C Hamano
2016-08-08 16:51           ` Jeff King
2016-08-08 17:16             ` Junio C Hamano
2016-08-09 14:04               ` Jeff King
2016-08-09 17:45                 ` Jeff King [this message]
2016-08-09 18:06                   ` Junio C Hamano
2016-08-09 22:29                 ` Junio C Hamano
2016-08-10 11:52                   ` [PATCH v3 0/2] pack-objects mru Jeff King
2016-08-10 12:02                     ` [PATCH v3 1/2] pack-objects: break delta cycles before delta-search phase Jeff King
2016-08-10 20:17                       ` Junio C Hamano
2016-08-11  5:02                         ` Jeff King
2016-08-11  5:15                           ` [PATCH v4 " Jeff King
2016-08-11  6:57                           ` [PATCH v3 " Jeff King
2016-08-11  9:20                             ` [PATCH v5] pack-objects mru Jeff King
2016-08-11  9:24                               ` [PATCH v5 1/4] provide an initializer for "struct object_info" Jeff King
2016-08-11  9:25                               ` [PATCH v5 2/4] sha1_file: make packed_object_info public Jeff King
2016-08-11  9:26                               ` [PATCH v5 3/4] pack-objects: break delta cycles before delta-search phase Jeff King
2016-08-11  9:26                               ` [PATCH v5 4/4] pack-objects: use mru list when iterating over packs Jeff King
2016-08-11  9:57                               ` [PATCH v5] pack-objects mru Jeff King
2016-08-11 15:11                                 ` Junio C Hamano
2016-08-11 16:19                                   ` Jeff King
2016-08-10 12:03                     ` [PATCH v3 2/2] pack-objects: use mru list when iterating over packs Jeff King
2016-08-10 16:47                     ` [PATCH v3 0/2] pack-objects mru Junio C Hamano
2016-08-11  4:48                       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160809174528.2ydgkhd7ayclat3t@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mhagger@alum.mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).