git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, Michael Haggerty <mhagger@alum.mit.edu>
Subject: Re: [PATCH v2 7/7] pack-objects: use mru list when iterating over packs
Date: Tue, 9 Aug 2016 10:04:12 -0400	[thread overview]
Message-ID: <20160809140411.7745apztp36nwshx@sigill.intra.peff.net> (raw)
In-Reply-To: <xmqqzionfafj.fsf@gitster.mtv.corp.google.com>

On Mon, Aug 08, 2016 at 10:16:32AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >> It worries me a lot to lose the warning unconditionally, though.
> >> That's the (only) coal-mine canary that lets us notice a problem
> >> when we actually start hitting that last-ditch cycle breaking too
> >> often.
> >
> > The dedicated cycle-detection will lose that warning, too (it doesn't
> > touch that code, but it's effectively checking the same thing earlier).
> >
> > I agree it's unfortunate to lose. On the other hand, it is being lost
> > because we are correctly handling the cycles, and there is nothing to
> > warn about. So it ceases to be a problem, and starts being a normal,
> > acceptable thing.
> 
> That unfortunately is beside the point.  The existing cycle breaking
> was meant to be a last ditch effort to avoid producing a broken pack
> (i.e. a suboptimal pack without cycle is better than unusable pack
> with delta cycle), while letting us know that we found a case where
> the remainder of the pack building machinery does not function well
> without it (so that we know we broke something when we tweaked the
> machinery without intending to break it).  Squelching the warnings
> feels similar to "we see too many valgrind warnings, so let's stop
> running valgrind"; I was hoping there would be a solution more like
> "instead of not running, let's teach valgrind this and that codepath
> is OK".

I don't think there is a way to do "this code path is OK", exactly.
Though by putting cycle-breaking at the check_object() stage, the
warning at the write_object() stage would continue to ensure that we
don't introduce any new cycles with our delta search. So that does have
some value.

Here's the code to do the cycle-breaking. Aside from the "hacky" bit,
it's quite simple.  I added a new state enum to object_entry to handle
the graph traversal. Since it only needs 2 bits, I _assume_ a compiler
can fit it in with the bitfields above (or at the very least give it its
own single byte so we just use what would otherwise be struct padding).
But I didn't check; if it turns out not to be the case we can easily
emulate it with two bitfields.  The write_object() check abuses the
"idx.offset" field to keep the same state, but we could convert it to
use these flags if we care.

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index c91d54a..07b6fea 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1499,6 +1499,67 @@ static int pack_offset_sort(const void *_a, const void *_b)
 			(a->in_pack_offset > b->in_pack_offset);
 }
 
+/*
+ * Follow the chain of deltas from this entry onward, throwing away any links
+ * that cause us to hit a cycle (as determined by the DFS state flags in
+ * the entries).
+ */
+static void break_delta_cycles(struct object_entry *entry)
+{
+	/* If it's not a delta, it can't be part of a cycle. */
+	if (!entry->delta) {
+		entry->dfs_state = DFS_DONE;
+		return;
+	}
+
+	switch (entry->dfs_state) {
+	case DFS_NONE:
+		/*
+		 * This is the first time we've seen the object. We become part
+		 * of the active potential cycle and recurse.
+		 */
+		entry->dfs_state = DFS_ACTIVE;
+		break_delta_cycles(entry->delta);
+		entry->dfs_state = DFS_DONE;
+		break;
+
+	case DFS_DONE:
+		/* object already examined, and not part of a cycle */
+		break;
+
+	case DFS_ACTIVE:
+		/*
+		 * We found a cycle that needs broken. We have to not only
+		 * drop our entry->delta link, but we need to remove
+		 * ourselves from the delta_sibling chain of our base.
+		 */
+		{
+			struct object_entry **p = &entry->delta->delta_child;
+			while (*p) {
+				if (*p == entry)
+					*p = (*p)->delta_sibling;
+				else
+					p = &(*p)->delta_sibling;
+			}
+		}
+		entry->delta = NULL;
+
+		/*
+		 * XXX This is hacky. We need to figure out our real size (not
+		 * the delta size). check_object() already does this, so let's
+		 * just re-run it, but telling it not to reuse any deltas. This
+		 * probably should just be a single function to track down the
+		 * size from the delta (or even just sha1_object_info(),
+		 * though that is a little less efficient because we already
+		 * know which pack we're in).
+		 */
+		reuse_delta = 0;
+		check_object(entry);
+		reuse_delta = 1;
+		break;
+	}
+}
+
 static void get_object_details(void)
 {
 	uint32_t i;
@@ -1516,6 +1577,13 @@ static void get_object_details(void)
 			entry->no_try_delta = 1;
 	}
 
+	/*
+	 * This must happen in a second pass, since we rely on the delta
+	 * information for the whole list being completed.
+	 */
+	for (i = 0; i < to_pack.nr_objects; i++)
+		break_delta_cycles(&to_pack.objects[i]);
+
 	free(sorted_by_offset);
 }
 
diff --git a/pack-objects.h b/pack-objects.h
index d1b98b3..cc9b9a9 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -27,6 +27,15 @@ struct object_entry {
 	unsigned no_try_delta:1;
 	unsigned tagged:1; /* near the very tip of refs */
 	unsigned filled:1; /* assigned write-order */
+
+	/*
+	 * State flags for depth-first search used for analyzing delta cycles.
+	 */
+	enum {
+		DFS_NONE = 0,
+		DFS_ACTIVE,
+		DFS_DONE
+	} dfs_state;
 };
 
 struct packing_data {


It seems to perform well, and it does break the cycles (the later check
during the write does not kick in, and we get no warnings). I didn't dig
into the fates of specific objects, but any cycles should be added to
the delta-compression phase.

The resulting pack size is almost exactly what it was with hitting the
write_object() check. So that means all of my testing was really not
that interesting, because the extra space isn't coming from the delta
cycles at all. It's simply an artifact of the different set of objects
we happen to find (just as reversing the pack list produces a different
order, with a different size).

So it happens to be a 3% loss in this case. I'm not convinced it would
not actually be a win in some other cases. But more importantly, in a
repository with fewer packs, it would likely be a much smaller
difference (in either direction), just because there's less play in
where we find a given object.

I think my preference is to clean up the "hacky" bit of this patch, and
then apply the earlier MRU patch on top of it (which takes my repack
from 44 minutes to 5 minutes for this particular test set). I see you
graduated the earlier bits of the series to "master" already.

-Peff

  reply	other threads:[~2016-08-09 14:04 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-29  4:04 [PATCH v2 0/7] speed up pack-objects counting with many packs Jeff King
2016-07-29  4:06 ` [PATCH v2 1/7] t/perf: add tests for many-pack scenarios Jeff King
2016-07-29  4:06 ` [PATCH v2 2/7] sha1_file: drop free_pack_by_name Jeff King
2016-07-29  4:06 ` [PATCH v2 3/7] add generic most-recently-used list Jeff King
2016-07-29  4:09 ` [PATCH v2 4/7] find_pack_entry: replace last_found_pack with MRU cache Jeff King
2016-07-29  4:10 ` [PATCH v2 5/7] pack-objects: break out of want_object loop early Jeff King
2016-07-29  4:11 ` [PATCH v2 6/7] pack-objects: compute local/ignore_pack_keep early Jeff King
2016-07-29  4:15 ` [PATCH v2 7/7] pack-objects: use mru list when iterating over packs Jeff King
2016-07-29  5:45   ` Jeff King
2016-07-29 15:02     ` Junio C Hamano
2016-08-08 14:50       ` Jeff King
2016-08-08 16:28         ` Junio C Hamano
2016-08-08 16:51           ` Jeff King
2016-08-08 17:16             ` Junio C Hamano
2016-08-09 14:04               ` Jeff King [this message]
2016-08-09 17:45                 ` Jeff King
2016-08-09 18:06                   ` Junio C Hamano
2016-08-09 22:29                 ` Junio C Hamano
2016-08-10 11:52                   ` [PATCH v3 0/2] pack-objects mru Jeff King
2016-08-10 12:02                     ` [PATCH v3 1/2] pack-objects: break delta cycles before delta-search phase Jeff King
2016-08-10 20:17                       ` Junio C Hamano
2016-08-11  5:02                         ` Jeff King
2016-08-11  5:15                           ` [PATCH v4 " Jeff King
2016-08-11  6:57                           ` [PATCH v3 " Jeff King
2016-08-11  9:20                             ` [PATCH v5] pack-objects mru Jeff King
2016-08-11  9:24                               ` [PATCH v5 1/4] provide an initializer for "struct object_info" Jeff King
2016-08-11  9:25                               ` [PATCH v5 2/4] sha1_file: make packed_object_info public Jeff King
2016-08-11  9:26                               ` [PATCH v5 3/4] pack-objects: break delta cycles before delta-search phase Jeff King
2016-08-11  9:26                               ` [PATCH v5 4/4] pack-objects: use mru list when iterating over packs Jeff King
2016-08-11  9:57                               ` [PATCH v5] pack-objects mru Jeff King
2016-08-11 15:11                                 ` Junio C Hamano
2016-08-11 16:19                                   ` Jeff King
2016-08-10 12:03                     ` [PATCH v3 2/2] pack-objects: use mru list when iterating over packs Jeff King
2016-08-10 16:47                     ` [PATCH v3 0/2] pack-objects mru Junio C Hamano
2016-08-11  4:48                       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160809140411.7745apztp36nwshx@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mhagger@alum.mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).