From: Junio C Hamano <gitster@pobox.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org, Michael Haggerty <mhagger@alum.mit.edu>
Subject: Re: [PATCH v3 1/2] pack-objects: break delta cycles before delta-search phase
Date: Wed, 10 Aug 2016 13:17:22 -0700 [thread overview]
Message-ID: <xmqqr39w4bvx.fsf@gitster.mtv.corp.google.com> (raw)
In-Reply-To: <20160810120248.i2hvm2q6ag3rvsk4@sigill.intra.peff.net> (Jeff King's message of "Wed, 10 Aug 2016 08:02:49 -0400")
Jeff King <peff@peff.net> writes:
> ...
> We could do analysis on any cycles that we find to
> distinguish the two cases (i.e., it is a bogus pack if and
> only if every delta in the cycle is in the same pack), but
> we don't need to. If there is a cycle inside a pack, we'll
> run into problems not only reusing the delta, but accessing
> the object data at all. So when we try to dig up the actual
> size of the object, we'll hit that same cycle and kick in
> our usual complain-and-try-another-source code.
I agree with all of the above reasoning.
> Actually, skimming the sha1_file code, I am not 100% sure that we detect
> cycles in OBJ_REF_DELTA (you cannot have cycles in OBJ_OFS_DELTA since
> they always point backwards in the pack). But if that is the case, then
> I think we should fix that, not worry about special-casing it here.
Yes, but sha1_file.c? It is the reading side and it is too late if
we notice a problem, I would think.
> +/*
> + * Drop an on-disk delta we were planning to reuse. Naively, this would
> + * just involve blanking out the "delta" field, but we have to deal
> + * with two extra pieces of book-keeping:
> + *
> + * 1. Removing ourselves from the delta_sibling linked list.
> + *
> + * 2. Updating our size; check_object() will have filled in the size of our
> + * delta, but a non-delta object needs it true size.
Excellent point.
> +/*
> + * Follow the chain of deltas from this entry onward, throwing away any links
> + * that cause us to hit a cycle (as determined by the DFS state flags in
> + * the entries).
> + */
> +static void break_delta_cycles(struct object_entry *entry)
> +{
> + /* If it's not a delta, it can't be part of a cycle. */
> + if (!entry->delta) {
> + entry->dfs_state = DFS_DONE;
> + return;
> + }
> +
> + switch (entry->dfs_state) {
> + case DFS_NONE:
> + /*
> + * This is the first time we've seen the object. We mark it as
> + * part of the active potential cycle and recurse.
> + */
> + entry->dfs_state = DFS_ACTIVE;
> + break_delta_cycles(entry->delta);
> + entry->dfs_state = DFS_DONE;
> + break;
> +
> + case DFS_DONE:
> + /* object already examined, and not part of a cycle */
> + break;
> +
> + case DFS_ACTIVE:
> + /*
> + * We found a cycle that needs broken. It would be correct to
> + * break any link in the chain, but it's convenient to
> + * break this one.
> + */
> + drop_reused_delta(entry);
> + break;
> + }
> +}
Do we need to do anything to the DFS state of an entry when
drop_reused_delta() resets its other fields? If we had this and
started from A (read "A--->B" as "A is based on B"):
A--->B--->C--->A
we paint A as ACTIVE, visit B and then C and paint them as active,
and when we visit A for the second time, we drop it (i.e. break the
link between A and B), return and paint C as DONE, return and paint
B as DONE, and leaving A painted as ACTIVE, while the chain is now
B--->C--->A
If we later find D that is directly based on A, wouldn't we end up
visiting A and attempt to drop it again? drop_reused_delta() is
idempotent so there will be no data structure corruption, I think,
but we can safely declare that the entry is now DONE after calling
drop_reused_delta() on it (either in the function or in the caller
after it calls the function), no?
> + 2. Picking the next pack to examine based on locality (i.e., where we found
> + something else recently).
> +
> + In this case, we want to make sure that we find the delta versions of A and
> + B and not their base versions. We can do this by putting two blobs in each
> + pack. The first is a "dummy" blob that can only be found in the pack in
> + question. And then the second is the actual delta we want to find.
> +
> + The two blobs must be present in the same tree, not present in other trees,
> + and the dummy pathname must sort before the delta path.
> +# Create a pack containing the the tree $1 and blob $1:file, with
> +# the latter stored as a delta against $2:file.
> +#
> +# We convince pack-objects to make the delta in the direction of our choosing
> +# by marking $2 as a preferred-base edge. That results in $1:file as a thin
> +# delta, and index-pack completes it by adding $2:file as a base.
Tricky but clever and correct ;-)
> +make_pack () {
> + {
> + echo "-$(git rev-parse $2)"
Is everybody's 'echo' happy with dash followed by unknown string?
> + echo "$(git rev-parse $1:dummy) dummy"
> + echo "$(git rev-parse $1:file) file"
> + } |
> + git pack-objects --stdout |
> + git index-pack --stdin --fix-thin
An alternative
git pack-objects --stdout <<-EOF |
-$(git rev-parse $2)
$(git rev-parse $1:dummy) dummy
$(git rev-parse $1:file) file
EOF
git index-pack --stdin --fix-thin
looks somewhat ugly, though.
> +}
next prev parent reply other threads:[~2016-08-10 20:17 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-07-29 4:04 [PATCH v2 0/7] speed up pack-objects counting with many packs Jeff King
2016-07-29 4:06 ` [PATCH v2 1/7] t/perf: add tests for many-pack scenarios Jeff King
2016-07-29 4:06 ` [PATCH v2 2/7] sha1_file: drop free_pack_by_name Jeff King
2016-07-29 4:06 ` [PATCH v2 3/7] add generic most-recently-used list Jeff King
2016-07-29 4:09 ` [PATCH v2 4/7] find_pack_entry: replace last_found_pack with MRU cache Jeff King
2016-07-29 4:10 ` [PATCH v2 5/7] pack-objects: break out of want_object loop early Jeff King
2016-07-29 4:11 ` [PATCH v2 6/7] pack-objects: compute local/ignore_pack_keep early Jeff King
2016-07-29 4:15 ` [PATCH v2 7/7] pack-objects: use mru list when iterating over packs Jeff King
2016-07-29 5:45 ` Jeff King
2016-07-29 15:02 ` Junio C Hamano
2016-08-08 14:50 ` Jeff King
2016-08-08 16:28 ` Junio C Hamano
2016-08-08 16:51 ` Jeff King
2016-08-08 17:16 ` Junio C Hamano
2016-08-09 14:04 ` Jeff King
2016-08-09 17:45 ` Jeff King
2016-08-09 18:06 ` Junio C Hamano
2016-08-09 22:29 ` Junio C Hamano
2016-08-10 11:52 ` [PATCH v3 0/2] pack-objects mru Jeff King
2016-08-10 12:02 ` [PATCH v3 1/2] pack-objects: break delta cycles before delta-search phase Jeff King
2016-08-10 20:17 ` Junio C Hamano [this message]
2016-08-11 5:02 ` Jeff King
2016-08-11 5:15 ` [PATCH v4 " Jeff King
2016-08-11 6:57 ` [PATCH v3 " Jeff King
2016-08-11 9:20 ` [PATCH v5] pack-objects mru Jeff King
2016-08-11 9:24 ` [PATCH v5 1/4] provide an initializer for "struct object_info" Jeff King
2016-08-11 9:25 ` [PATCH v5 2/4] sha1_file: make packed_object_info public Jeff King
2016-08-11 9:26 ` [PATCH v5 3/4] pack-objects: break delta cycles before delta-search phase Jeff King
2016-08-11 9:26 ` [PATCH v5 4/4] pack-objects: use mru list when iterating over packs Jeff King
2016-08-11 9:57 ` [PATCH v5] pack-objects mru Jeff King
2016-08-11 15:11 ` Junio C Hamano
2016-08-11 16:19 ` Jeff King
2016-08-10 12:03 ` [PATCH v3 2/2] pack-objects: use mru list when iterating over packs Jeff King
2016-08-10 16:47 ` [PATCH v3 0/2] pack-objects mru Junio C Hamano
2016-08-11 4:48 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqqr39w4bvx.fsf@gitster.mtv.corp.google.com \
--to=gitster@pobox.com \
--cc=git@vger.kernel.org \
--cc=mhagger@alum.mit.edu \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).