git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: "Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, derrickstolee@github.com,
	shaoxuan.yuan02@gmail.com, newren@gmail.com,
	Victoria Dye <vdye@github.com>
Subject: Re: [PATCH 4/4] unpack-trees: handle missing sparse directories
Date: Thu, 04 Aug 2022 16:23:12 -0700	[thread overview]
Message-ID: <xmqqa68j1tlr.fsf@gitster.g> (raw)
In-Reply-To: <016971a67112efe2d15fe7908e86c5d2631f8e66.1659645967.git.gitgitgadget@gmail.com> (Victoria Dye via GitGitGadget's message of "Thu, 04 Aug 2022 20:46:07 +0000")

"Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Victoria Dye <vdye@github.com>
>
> If a sparse directory does not exist in the index, unpack it at the
> directory level rather than recursing into it an unpacking its contents
> file-by-file. This helps keep the sparse index as collapsed as possible in
> cases such as 'git reset --hard' restoring a sparse directory.

My reading hiccuped at "does not exist in".  After reading the above
twice, I think the situation is that there is a directory A/B in
which a file C sits, and A/ is marked as sparse and the paragraph is
talking about directory A/B.  Because the index has A/ as a tree in
index, A/B does not exist in the index.

When we _somehow_ need to ensure that A/B exists in the index for
_unknown_ reason, we could flatten the index fully and have A/B/C as
a blob (without A or A/B in the index proper, even though they may
appear in cache-tree), but if we stop at removing A and adding A/B
still as a tree without showing A/B/C in the index, we may gain
efficiency, under the assumption that we do not have to access A/B/C
and its siblings.

Did I read the intention correctly?  I suspect future readers of the
commit that would result from this patch would also wonder what the
"somehow" and "unknown" above are.

> A directory is determined to be truly non-existent in the index (rather than
> the parent of existing index entries), if 1) its path is outside the sparse
> cone and 2) there are no children of the directory in the index. This check
> is performed by 'missing_dir_is_sparse()' in 'unpack_single_entry()'. If the
> directory is a missing sparse dir, 'unpack_single_entry()'  will proceed
> with unpacking it. This determination is also propagated back up to
> 'unpack_callback()' via 'is_missing_sparse_dir' to prevent further tree
> traversal into the unpacked directory.

Makes sense.

> diff --git a/unpack-trees.c b/unpack-trees.c
> index 8a454e03bff..aa62cef20fe 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -1069,6 +1069,53 @@ static struct cache_entry *create_ce_entry(const struct traverse_info *info,
>  	return ce;
>  }
>  
> +/*
> + * Determine whether the path specified corresponds to a sparse directory
> + * completely missing from the index. This function is assumed to only be
> + * called when the named path isn't already in the index.
> + */
> +static int missing_dir_is_sparse(const struct traverse_info *info,
> +				 const struct name_entry *p)
> +{
> +	int res, pos;
> +	struct strbuf dirpath = STRBUF_INIT;
> +	struct unpack_trees_options *o = info->data;
> +
> +	/*
> +	 * First, check whether the path is in the sparse cone. If it is,
> +	 * then this directory shouldn't be sparse.
> +	 */
> +	strbuf_add(&dirpath, info->traverse_path, info->pathlen);
> +	strbuf_add(&dirpath, p->path, p->pathlen);
> +	strbuf_addch(&dirpath, '/');
> +	if (path_in_cone_mode_sparse_checkout(dirpath.buf, o->src_index)) {
> +		res = 0;
> +		goto cleanup;
> +	}

OK.

> +	/*
> +	 * Given that the directory is not inside the sparse cone, it could be
> +	 * (partially) expanded in the index. If child entries exist, the path
> +	 * is not a missing sparse directory.
> +	 */
> +	pos = index_name_pos_sparse(o->src_index, dirpath.buf, dirpath.len);
> +	if (pos >= 0)
> +		BUG("cache entry '%s%s' shouldn't exist in the index",
> +		    info->traverse_path, p->path);

So, we fed 'p' to this function, and we didn't expect it to exist in
the index if it is outside the sparse cone.

> +	pos = -pos - 1;

This is the location that a cache entry for the dirpath.buf would
sit in the index if it were a sparse entry.

> +	if (pos >= o->src_index->cache_nr) {
> +		res = 1;
> +		goto cleanup;
> +	}
> +	res = strncmp(o->src_index->cache[pos]->name, dirpath.buf, dirpath.len);

So, we found where dirpath.buf would be inserted.  If we (can) look
at the cache entry that is currently sitting at that position, and
find that it is a path inside it, then res becomes zero.  IOW, we
found that the sparse directory is missing in the index, but there
is a path that is in the directory recorded in the index.

The current entry, on the other hand, is outside the dirpath.buf,
then res becomes non-zero.  We are asked to yield true (i.e.
nonzero) if and only if the given directory and all paths in that
directory are missing from the index, so that is what happens here.
Sounds OK.

And the "out of bounds" check just means that the entries that sit
near the end of the index sort strictly before our dirpath.buf, so
they cannot be inside our directory, which is also the case where we
are expected to yield "true".

OK.

> +
> +cleanup:
> +	strbuf_release(&dirpath);
> +	return res;
> +}
> +
>  /*
>   * Note that traverse_by_cache_tree() duplicates some logic in this function
>   * without actually calling it. If you change the logic here you may need to
> @@ -1078,21 +1125,40 @@ static int unpack_single_entry(int n, unsigned long mask,
>  			       unsigned long dirmask,
>  			       struct cache_entry **src,
>  			       const struct name_entry *names,
> -			       const struct traverse_info *info)
> +			       const struct traverse_info *info,
> +			       int *is_missing_sparse_dir)
>  {
>  	int i;
>  	struct unpack_trees_options *o = info->data;
>  	unsigned long conflicts = info->df_conflicts | dirmask;
> +	const struct name_entry *p = names;
>  
> -	if (mask == dirmask && !src[0])
> -		return 0;
> +	*is_missing_sparse_dir = 0;
> +	if (mask == dirmask && !src[0]) {
> +		/*
> +		 * If the directory is completely missing from the index but
> +		 * would otherwise be a sparse directory, we should unpack it.
> +		 * If not, we'll return and continue recursively traversing the
> +		 * tree.
> +		 */
> +		if (!o->src_index->sparse_index)
> +			return 0;
> +
> +		/* Find first entry with a real name (we could use "mask" too) */
> +		while (!p->mode)
> +			p++;
> +
> +		*is_missing_sparse_dir = missing_dir_is_sparse(info, p);
> +		if (!*is_missing_sparse_dir)
> +			return 0;

Interesting.  Nobody checked if the name in 'p' is a directory up to
this point, but missing_dir_is_sparse() does not care?  The only
thing we know is that !src[0], so the name represented by 'p' does
not exist in the index.  IIRC, the function only checked if p names
a path inside or outside the sparse cone and if there are or are not
paths that would be inside the path if p _were_ a directory but I
didn't read it carefully what it did when the caller fed a path to a
non-directory to the function.  As long as it will say "no" to such
a situation, I won't complain, but I have to wonder how the logic
around here tells apart a case where a sparse directory is
completely hidden (which missing_dir_is_sparse() gives us) from a
case where there is absolutely nothing, not a directory and not a
blob, at the path in the index.  Or perhaps it works correctly
without having to tell them apart?  I dunno.

By the way, there is a comment before the function that reminds us
that traverse_by_cache_tree() may need a matching change whenever a
change is made to this function.  Does this require such a matching
change?

  reply	other threads:[~2022-08-04 23:23 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-04 20:46 [PATCH 0/4] reset/checkout: fix miscellaneous sparse index bugs Victoria Dye via GitGitGadget
2022-08-04 20:46 ` [PATCH 1/4] checkout: fix nested sparse directory diff in sparse index Victoria Dye via GitGitGadget
2022-08-05 17:59   ` Derrick Stolee
2022-08-04 20:46 ` [PATCH 2/4] oneway_diff: handle removed sparse directories Victoria Dye via GitGitGadget
2022-08-04 20:46 ` [PATCH 3/4] cache.h: replace 'index_entry_exists()' with 'index_name_pos_sparse()' Victoria Dye via GitGitGadget
2022-08-04 22:16   ` Junio C Hamano
2022-08-06  0:09   ` Junio C Hamano
2022-08-04 20:46 ` [PATCH 4/4] unpack-trees: handle missing sparse directories Victoria Dye via GitGitGadget
2022-08-04 23:23   ` Junio C Hamano [this message]
2022-08-05 16:36     ` Victoria Dye
2022-08-05 19:24       ` Junio C Hamano
2022-08-07  2:57 ` [PATCH v2 0/4] reset/checkout: fix miscellaneous sparse index bugs Victoria Dye via GitGitGadget
2022-08-07  2:57   ` [PATCH v2 1/4] checkout: fix nested sparse directory diff in sparse index Victoria Dye via GitGitGadget
2022-08-07  2:57   ` [PATCH v2 2/4] oneway_diff: handle removed sparse directories Victoria Dye via GitGitGadget
2022-08-07  2:57   ` [PATCH v2 3/4] cache.h: create 'index_name_pos_sparse()' Victoria Dye via GitGitGadget
2022-08-07  2:57   ` [PATCH v2 4/4] unpack-trees: unpack new trees as sparse directories Victoria Dye via GitGitGadget
2022-08-08 19:07   ` [PATCH v3 0/4] reset/checkout: fix miscellaneous sparse index bugs Victoria Dye via GitGitGadget
2022-08-08 19:07     ` [PATCH v3 1/4] checkout: fix nested sparse directory diff in sparse index Victoria Dye via GitGitGadget
2022-08-08 19:07     ` [PATCH v3 2/4] oneway_diff: handle removed sparse directories Victoria Dye via GitGitGadget
2022-08-08 19:07     ` [PATCH v3 3/4] cache.h: create 'index_name_pos_sparse()' Victoria Dye via GitGitGadget
2022-08-08 19:07     ` [PATCH v3 4/4] unpack-trees: unpack new trees as sparse directories Victoria Dye via GitGitGadget
2022-08-08 21:17     ` [PATCH v3 0/4] reset/checkout: fix miscellaneous sparse index bugs Junio C Hamano
2022-08-09 13:20       ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqa68j1tlr.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=newren@gmail.com \
    --cc=shaoxuan.yuan02@gmail.com \
    --cc=vdye@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).