git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Cc: Michael Haggerty <mhagger@alum.mit.edu>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH v2 7/7] pack-objects: use mru list when iterating over packs
Date: Fri, 29 Jul 2016 01:45:37 -0400	[thread overview]
Message-ID: <20160729054536.GA27343@sigill.intra.peff.net> (raw)
In-Reply-To: <20160729041524.GG22408@sigill.intra.peff.net>

On Fri, Jul 29, 2016 at 12:15:24AM -0400, Jeff King wrote:

> Note that this reordering does have a potential impact on
> the final pack, as we store only a single "found" pack for
> each object, even if it is present in multiple packs. In
> principle, any copy is acceptable, as they all refer to the
> same content. But in practice, they may differ in whether
> they are stored as deltas, against which base, etc. This may
> have an impact on delta reuse, and even the delta search
> (since we skip pairs that were already in the same pack).
> 
> It's not clear whether this change of order would hurt or
> even help average cases, though. The most likely reason to
> have duplicate objects is from the completion of thin packs
> (e.g., you have some objects, then receive several pushes;
> the packs you receive may be thin on the wire, with deltas
> that refer to bases outside the pack, but we complete them
> with duplicate base objects when indexing them).
> 
> In such a case the current code would always find the thin
> duplicates (because we currently walk the packs in reverse
> chronological order). With this patch, it's possible that
> some of them would be found in older packs instead. But
> again, it's unclear whether that is a net win or loss in
> practice.

Hmm, so the efficacy of packing aside, this does definitely have a
negative effect.

I happened to have a repository sitting around that has 15 million
objects and 3600 packs (don't ask), so this seemed like a good test.

With this patch series, it took 11 minutes to do the counting, delta
compression, and write phases. Without it, after 11 minutes git had not
even gotten 10% of the way through counting. So that's good news.

The not-so-good news is that during the write phase, it hit the
"recursive delta detected" warning in write_one(), many times.

I think what is happening is that in the original code, we cannot ever
see a delta cycle, because the pack ordering is fixed. So if `A` is a
delta of `B`, then we know that they must both exist in the same pack
(since we do not do cross-pack deltas on disk). And because we look at
the packs in the same order for each object, we know that if we are
going to find `A`, we must either find `B` in the same pack (or a prior
one if there is another duplicate). But if we do so, then we cannot
also find `B` as a delta of `A` in that pack (because we know that packs
do not have delta cycles themselves) or an earlier pack (because if so,
we would have found `A` in that pack, too).

But because this series switches the order of pack-lookup between
objects, it is possible for us to find a `B` which is a delta against
`A` in one pack, and then another copy of `A` which is a delta against
another copy of `B` from another pack. We add both of the deltas to our
packing list, but at write time when we try to write out all of the
bases for `A`, we realize that whoops, we are recursing infinitely.

As it turns out, Git actually handles this pretty well! Upon noticing
the recursion, it breaks the delta chain and writes out one of the
objects as a full base. This is due to Junio's f63c79d (pack-object:
tolerate broken packs that have duplicated objects, 2011-11-16), though
I think we later decided that duplicated objects were simply insane.

So one option is to simply silence the warning, because the resulting
pack is perfectly fine. But we do notice this during the write phase,
after the delta search is done. So it's possible that the resulting pack
is not as small as it could be (i.e., we break the chain with a base
object, but it's possible if we looked that we could have broken the
chain by making a delta against an existing base object). So I wonder if
it's possible to detect this case earlier, during the "can we reuse this
delta" bits of check_object().

Suggestions welcome. I haven't really dug past what I've written here,
and it's way too late here to go any further tonight.

-Peff

  reply	other threads:[~2016-07-29  5:45 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-29  4:04 [PATCH v2 0/7] speed up pack-objects counting with many packs Jeff King
2016-07-29  4:06 ` [PATCH v2 1/7] t/perf: add tests for many-pack scenarios Jeff King
2016-07-29  4:06 ` [PATCH v2 2/7] sha1_file: drop free_pack_by_name Jeff King
2016-07-29  4:06 ` [PATCH v2 3/7] add generic most-recently-used list Jeff King
2016-07-29  4:09 ` [PATCH v2 4/7] find_pack_entry: replace last_found_pack with MRU cache Jeff King
2016-07-29  4:10 ` [PATCH v2 5/7] pack-objects: break out of want_object loop early Jeff King
2016-07-29  4:11 ` [PATCH v2 6/7] pack-objects: compute local/ignore_pack_keep early Jeff King
2016-07-29  4:15 ` [PATCH v2 7/7] pack-objects: use mru list when iterating over packs Jeff King
2016-07-29  5:45   ` Jeff King [this message]
2016-07-29 15:02     ` Junio C Hamano
2016-08-08 14:50       ` Jeff King
2016-08-08 16:28         ` Junio C Hamano
2016-08-08 16:51           ` Jeff King
2016-08-08 17:16             ` Junio C Hamano
2016-08-09 14:04               ` Jeff King
2016-08-09 17:45                 ` Jeff King
2016-08-09 18:06                   ` Junio C Hamano
2016-08-09 22:29                 ` Junio C Hamano
2016-08-10 11:52                   ` [PATCH v3 0/2] pack-objects mru Jeff King
2016-08-10 12:02                     ` [PATCH v3 1/2] pack-objects: break delta cycles before delta-search phase Jeff King
2016-08-10 20:17                       ` Junio C Hamano
2016-08-11  5:02                         ` Jeff King
2016-08-11  5:15                           ` [PATCH v4 " Jeff King
2016-08-11  6:57                           ` [PATCH v3 " Jeff King
2016-08-11  9:20                             ` [PATCH v5] pack-objects mru Jeff King
2016-08-11  9:24                               ` [PATCH v5 1/4] provide an initializer for "struct object_info" Jeff King
2016-08-11  9:25                               ` [PATCH v5 2/4] sha1_file: make packed_object_info public Jeff King
2016-08-11  9:26                               ` [PATCH v5 3/4] pack-objects: break delta cycles before delta-search phase Jeff King
2016-08-11  9:26                               ` [PATCH v5 4/4] pack-objects: use mru list when iterating over packs Jeff King
2016-08-11  9:57                               ` [PATCH v5] pack-objects mru Jeff King
2016-08-11 15:11                                 ` Junio C Hamano
2016-08-11 16:19                                   ` Jeff King
2016-08-10 12:03                     ` [PATCH v3 2/2] pack-objects: use mru list when iterating over packs Jeff King
2016-08-10 16:47                     ` [PATCH v3 0/2] pack-objects mru Junio C Hamano
2016-08-11  4:48                       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160729054536.GA27343@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mhagger@alum.mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).