From: Victoria Dye <vdye@github.com>
To: Junio C Hamano <gitster@pobox.com>,
Victoria Dye via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, derrickstolee@github.com,
shaoxuan.yuan02@gmail.com, newren@gmail.com
Subject: Re: [PATCH 4/4] unpack-trees: handle missing sparse directories
Date: Fri, 5 Aug 2022 09:36:47 -0700 [thread overview]
Message-ID: <3825ef9a-4c71-21ed-6452-bbd322ca839c@github.com> (raw)
In-Reply-To: <xmqqa68j1tlr.fsf@gitster.g>
Junio C Hamano wrote:
> "Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Victoria Dye <vdye@github.com>
>>
>> If a sparse directory does not exist in the index, unpack it at the
>> directory level rather than recursing into it an unpacking its contents
>> file-by-file. This helps keep the sparse index as collapsed as possible in
>> cases such as 'git reset --hard' restoring a sparse directory.
>
> My reading hiccuped at "does not exist in". After reading the above
> twice, I think the situation is that there is a directory A/B in
> which a file C sits, and A/ is marked as sparse and the paragraph is
> talking about directory A/B. Because the index has A/ as a tree in
> index, A/B does not exist in the index.
> > When we _somehow_ need to ensure that A/B exists in the index for
> _unknown_ reason, we could flatten the index fully and have A/B/C as
> a blob (without A or A/B in the index proper, even though they may
> appear in cache-tree), but if we stop at removing A and adding A/B
> still as a tree without showing A/B/C in the index, we may gain
> efficiency, under the assumption that we do not have to access A/B/C
> and its siblings.
>
> Did I read the intention correctly? I suspect future readers of the
> commit that would result from this patch would also wonder what the
> "somehow" and "unknown" above are.
>
If I'm reading this correctly, it's not quite what I meant - the situation
this patch addresses is when _nothing_ in the tree rooted at 'A/' (files,
subdirectories) exists in the index, but 'unpack_trees()' is merging tree(s)
where 'A/' *does* exist. I originally wanted to write "If a sparse directory
*is deleted from* the index", but that doesn't cover the case where 'A/'
never existed in the index and we're merging in tree(s) that introduce it.
Maybe it would be clearer to describe it with a different perspective: "If
'unpack_callback()' is merging a new tree into a sparse index, merge the
tree as a sparse directory rather than traversing its contents" or something
like that? I'll try to come up with a better way of explaining this and
update in V2.
>> A directory is determined to be truly non-existent in the index (rather than
>> the parent of existing index entries), if 1) its path is outside the sparse
>> cone and 2) there are no children of the directory in the index. This check
>> is performed by 'missing_dir_is_sparse()' in 'unpack_single_entry()'. If the
>> directory is a missing sparse dir, 'unpack_single_entry()' will proceed
>> with unpacking it. This determination is also propagated back up to
>> 'unpack_callback()' via 'is_missing_sparse_dir' to prevent further tree
>> traversal into the unpacked directory.
>
> Makes sense.
>
>> diff --git a/unpack-trees.c b/unpack-trees.c
>> index 8a454e03bff..aa62cef20fe 100644
>> --- a/unpack-trees.c
>> +++ b/unpack-trees.c
>> @@ -1069,6 +1069,53 @@ static struct cache_entry *create_ce_entry(const struct traverse_info *info,
>> return ce;
>> }
>>
>> +/*
>> + * Determine whether the path specified corresponds to a sparse directory
>> + * completely missing from the index. This function is assumed to only be
>> + * called when the named path isn't already in the index.
>> + */
>> +static int missing_dir_is_sparse(const struct traverse_info *info,
>> + const struct name_entry *p)
>> +{
>> + int res, pos;
>> + struct strbuf dirpath = STRBUF_INIT;
>> + struct unpack_trees_options *o = info->data;
>> +
>> + /*
>> + * First, check whether the path is in the sparse cone. If it is,
>> + * then this directory shouldn't be sparse.
>> + */
>> + strbuf_add(&dirpath, info->traverse_path, info->pathlen);
>> + strbuf_add(&dirpath, p->path, p->pathlen);
>> + strbuf_addch(&dirpath, '/');
>> + if (path_in_cone_mode_sparse_checkout(dirpath.buf, o->src_index)) {
>> + res = 0;
>> + goto cleanup;
>> + }
>
> OK.
>
>> + /*
>> + * Given that the directory is not inside the sparse cone, it could be
>> + * (partially) expanded in the index. If child entries exist, the path
>> + * is not a missing sparse directory.
>> + */
>> + pos = index_name_pos_sparse(o->src_index, dirpath.buf, dirpath.len);
>> + if (pos >= 0)
>> + BUG("cache entry '%s%s' shouldn't exist in the index",
>> + info->traverse_path, p->path);
>
> So, we fed 'p' to this function, and we didn't expect it to exist in
> the index if it is outside the sparse cone.
>
Correct; if an index entry with the name 'p' existed, 'src[0]' would not be
NULL in 'unpack_callback()'/'unpack_single_entry()' and this function
wouldn't have been called.
>> + pos = -pos - 1;
>
> This is the location that a cache entry for the dirpath.buf would
> sit in the index if it were a sparse entry.
>
>> + if (pos >= o->src_index->cache_nr) {
>> + res = 1;
>> + goto cleanup;
>> + }
>> + res = strncmp(o->src_index->cache[pos]->name, dirpath.buf, dirpath.len);
>
> So, we found where dirpath.buf would be inserted. If we (can) look
> at the cache entry that is currently sitting at that position, and
> find that it is a path inside it, then res becomes zero. IOW, we
> found that the sparse directory is missing in the index, but there
> is a path that is in the directory recorded in the index.
>
> The current entry, on the other hand, is outside the dirpath.buf,
> then res becomes non-zero. We are asked to yield true (i.e.
> nonzero) if and only if the given directory and all paths in that
> directory are missing from the index, so that is what happens here.
> Sounds OK.
>
> And the "out of bounds" check just means that the entries that sit
> near the end of the index sort strictly before our dirpath.buf, so
> they cannot be inside our directory, which is also the case where we
> are expected to yield "true".
>
> OK.
>
>> +
>> +cleanup:
>> + strbuf_release(&dirpath);
>> + return res;
>> +}
>> +
>> /*
>> * Note that traverse_by_cache_tree() duplicates some logic in this function
>> * without actually calling it. If you change the logic here you may need to
>> @@ -1078,21 +1125,40 @@ static int unpack_single_entry(int n, unsigned long mask,
>> unsigned long dirmask,
>> struct cache_entry **src,
>> const struct name_entry *names,
>> - const struct traverse_info *info)
>> + const struct traverse_info *info,
>> + int *is_missing_sparse_dir)
>> {
>> int i;
>> struct unpack_trees_options *o = info->data;
>> unsigned long conflicts = info->df_conflicts | dirmask;
>> + const struct name_entry *p = names;
>>
>> - if (mask == dirmask && !src[0])
>> - return 0;
>> + *is_missing_sparse_dir = 0;
>> + if (mask == dirmask && !src[0]) {
>> + /*
>> + * If the directory is completely missing from the index but
>> + * would otherwise be a sparse directory, we should unpack it.
>> + * If not, we'll return and continue recursively traversing the
>> + * tree.
>> + */
>> + if (!o->src_index->sparse_index)
>> + return 0;
>> +
>> + /* Find first entry with a real name (we could use "mask" too) */
>> + while (!p->mode)
>> + p++;
>> +
>> + *is_missing_sparse_dir = missing_dir_is_sparse(info, p);
>> + if (!*is_missing_sparse_dir)
>> + return 0;
>
> Interesting. Nobody checked if the name in 'p' is a directory up to
> this point, but missing_dir_is_sparse() does not care? The only
> thing we know is that !src[0], so the name represented by 'p' does
> not exist in the index. IIRC, the function only checked if p names
> a path inside or outside the sparse cone and if there are or are not
> paths that would be inside the path if p _were_ a directory but I
> didn't read it carefully what it did when the caller fed a path to a
> non-directory to the function. As long as it will say "no" to such
> a situation, I won't complain, but I have to wonder how the logic
> around here tells apart a case where a sparse directory is
> completely hidden (which missing_dir_is_sparse() gives us) from a
> case where there is absolutely nothing, not a directory and not a
> blob, at the path in the index. Or perhaps it works correctly
> without having to tell them apart? I dunno.
>
I wrote 'missing_dir_is_sparse()' in an attempt keep some complex logic out
of the already-complicated 'unpack_single_entry()', but as part of that it
relies on information already established by its caller.
We know 'p' is a directory because 'missing_dir_is_sparse()' is called
inside a 'mask == dirmask' condition. 'mask' is a representation of which
trees being traversed have an entry with the given name and 'dirmask' is a
representation of which of those entries are directories, so the only way
'mask == dirmask' and 'p' is *not* a directory is if the currently-traversed
entries in all of the trees do not exist. *That* won't happen because
'unpack_callback()' won't be invoked at all by 'traverse_trees()' if 'mask'
is 0.
Given that it requires jumping through multiple function invocations and
callbacks to figure that out, I can add some assertions or 'return 0'
conditions at the beginning of 'missing_dir_is_sparse()' to codify its
assumptions. Even if the assertions are slightly redundant now, they'll make
the code clearer and make the function safer for reuse.
> By the way, there is a comment before the function that reminds us
> that traverse_by_cache_tree() may need a matching change whenever a
> change is made to this function. Does this require such a matching
> change?
In this case, I *think* a parallel change there is unnecessary.
'traverse_by_cache_tree()' iterates on the contents of the index, so its
equivalent of 'src[0]' always exists when it merges individual entries. And,
to trigger that function, the cache tree needs to identically match the
tree(s) being merged; if a subtree doesn't exist in either the index or any
of the merging trees, there's nothing to merge.
In any case, thanks for the reminder - I didn't notice the comment while
implementing this, so it could have just as easily needed fixing.
next prev parent reply other threads:[~2022-08-05 16:36 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-04 20:46 [PATCH 0/4] reset/checkout: fix miscellaneous sparse index bugs Victoria Dye via GitGitGadget
2022-08-04 20:46 ` [PATCH 1/4] checkout: fix nested sparse directory diff in sparse index Victoria Dye via GitGitGadget
2022-08-05 17:59 ` Derrick Stolee
2022-08-04 20:46 ` [PATCH 2/4] oneway_diff: handle removed sparse directories Victoria Dye via GitGitGadget
2022-08-04 20:46 ` [PATCH 3/4] cache.h: replace 'index_entry_exists()' with 'index_name_pos_sparse()' Victoria Dye via GitGitGadget
2022-08-04 22:16 ` Junio C Hamano
2022-08-06 0:09 ` Junio C Hamano
2022-08-04 20:46 ` [PATCH 4/4] unpack-trees: handle missing sparse directories Victoria Dye via GitGitGadget
2022-08-04 23:23 ` Junio C Hamano
2022-08-05 16:36 ` Victoria Dye [this message]
2022-08-05 19:24 ` Junio C Hamano
2022-08-07 2:57 ` [PATCH v2 0/4] reset/checkout: fix miscellaneous sparse index bugs Victoria Dye via GitGitGadget
2022-08-07 2:57 ` [PATCH v2 1/4] checkout: fix nested sparse directory diff in sparse index Victoria Dye via GitGitGadget
2022-08-07 2:57 ` [PATCH v2 2/4] oneway_diff: handle removed sparse directories Victoria Dye via GitGitGadget
2022-08-07 2:57 ` [PATCH v2 3/4] cache.h: create 'index_name_pos_sparse()' Victoria Dye via GitGitGadget
2022-08-07 2:57 ` [PATCH v2 4/4] unpack-trees: unpack new trees as sparse directories Victoria Dye via GitGitGadget
2022-08-08 19:07 ` [PATCH v3 0/4] reset/checkout: fix miscellaneous sparse index bugs Victoria Dye via GitGitGadget
2022-08-08 19:07 ` [PATCH v3 1/4] checkout: fix nested sparse directory diff in sparse index Victoria Dye via GitGitGadget
2022-08-08 19:07 ` [PATCH v3 2/4] oneway_diff: handle removed sparse directories Victoria Dye via GitGitGadget
2022-08-08 19:07 ` [PATCH v3 3/4] cache.h: create 'index_name_pos_sparse()' Victoria Dye via GitGitGadget
2022-08-08 19:07 ` [PATCH v3 4/4] unpack-trees: unpack new trees as sparse directories Victoria Dye via GitGitGadget
2022-08-08 21:17 ` [PATCH v3 0/4] reset/checkout: fix miscellaneous sparse index bugs Junio C Hamano
2022-08-09 13:20 ` Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3825ef9a-4c71-21ed-6452-bbd322ca839c@github.com \
--to=vdye@github.com \
--cc=derrickstolee@github.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=gitster@pobox.com \
--cc=newren@gmail.com \
--cc=shaoxuan.yuan02@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).