Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: William Sprent <williams@unity3d.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"William Sprent via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, Eric Sunshine <sunshine@sunshineco.com>,
	Derrick Stolee <derrickstolee@github.com>,
	Victoria Dye <vdye@github.com>, Elijah Newren <newren@gmail.com>
Subject: Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
Date: Tue, 24 Jan 2023 16:30:29 +0100	[thread overview]
Message-ID: <9787a68d-80a3-f554-355c-5015798a8816@unity3d.com> (raw)
In-Reply-To: <230123.861qnltmbp.gmgdl@evledraar.gmail.com>

On 23/01/2023 14.06, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Jan 23 2023, William Sprent via GitGitGadget wrote:
> 
>> From: William Sprent <williams@unity3d.com>
> 
> ...some further inline commentary, in addition to my proposed squash-in...
> 

Thanks for taking the time. :)

>> diff --git a/Documentation/git-ls-tree.txt b/Documentation/git-ls-tree.txt
>> index 0240adb8eec..724bb1f9dbd 100644
>> --- a/Documentation/git-ls-tree.txt
>> +++ b/Documentation/git-ls-tree.txt
>> @@ -11,6 +11,7 @@ SYNOPSIS
>>   [verse]
>>   'git ls-tree' [-d] [-r] [-t] [-l] [-z]
>>   	    [--name-only] [--name-status] [--object-only] [--full-name] [--full-tree] [--abbrev[=<n>]] [--format=<format>]
>> +	    [--sparse-filter-oid=<blob-ish>]
>>   	    <tree-ish> [<path>...]
>>   
>>   DESCRIPTION
>> @@ -48,6 +49,11 @@ OPTIONS
>>   	Show tree entries even when going to recurse them. Has no effect
>>   	if `-r` was not passed. `-d` implies `-t`.
>>   
>> +--sparse-filter-oid=<blob-ish>::
>> +	Omit showing tree objects and paths that do not match the
>> +	sparse-checkout specification contained in the blob
>> +	(or blob-expression) <blob-ish>.
>> +
>>   -l::
>>   --long::
>>   	Show object size of blob (file) entries.
>> diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
>> index 72eb70823d5..46a815fc7dc 100644
>> --- a/builtin/ls-tree.c
>> +++ b/builtin/ls-tree.c
>> @@ -13,6 +13,7 @@
>>   #include "builtin.h"
>>   #include "parse-options.h"
>>   #include "pathspec.h"
>> +#include "dir.h"
>>   
>>   static const char * const ls_tree_usage[] = {
>>   	N_("git ls-tree [<options>] <tree-ish> [<path>...]"),
>> @@ -364,6 +365,65 @@ static struct ls_tree_cmdmode_to_fmt ls_tree_cmdmode_format[] = {
>>   	},
>>   };
> 
> Okey, here yo uupdate the *.txt, but not the "-h", but it's one of those
> where way say <options>.
> 
> I for one wouldn't mind some preceding change like my 423be1f83c5 (doc
> txt & -h consistency: make "commit" consistent, 2022-10-13), which in
> turn would make t/t0450-txt-doc-vs-help.sh pass for this command, but
> maybe it's too much of a digression...
> 

I don't mind doing that.

>> +	(*d) = xcalloc(1, sizeof(**d));
>> +	(*d)->fn = fn;
>> +	(*d)->pl.use_cone_patterns = core_sparse_checkout_cone;
> 
> I forgot to note in my proposed fix-up that I omitted this, but your
> tests still pass, which suggests it's either not needed, or your tests
> are lacking....
> 

Well.. The line exists to enable the cone mode optimisations. The only 
observable differences are performance and that a warning is emitted 
when cone mode is enabled in the config but the patterns given aren't in 
the cone mode pattern set.

I can look into writing a performance test that captures the performance 
differences.

I can also test for the latter behaviour. But that is testing for the 
side effect of a main behaviour to ensure that the main behaviour is 
there. Which isn't the nicest thing I can think of, but it might be 
better than nothing.

>> +	(*d)->options = options;
>> +	strbuf_init(&(*d)->full_path_buf, 0);
>> +
>> +	if (get_oid_with_context(the_repository, sparse_oid_name, GET_OID_BLOB,
>> +			&sparse_oid, &oc))
>> +		die(_("unable to access sparse blob in '%s'"), sparse_oid_name);
>> +	if (add_patterns_from_blob_to_list(&sparse_oid, "", 0, &(*d)->pl) < 0)
> 
> As noted you can avoid the malloc here, which also sidesteps this (IMO
> at last) ugly &(*v)->m syntax.
> 

Right.

>> +		die(_("unable to parse sparse filter data in %s"),
>> +		    oid_to_hex(&sparse_oid));
>> +}
>> +
>> +static void free_sparse_filter_data(struct sparse_filter_data *d)
>> +{
>> +	clear_pattern_list(&d->pl);
>> +	strbuf_release(&d->full_path_buf);
>> +	free(d);
>> +}
>> +
>> +static int path_matches_sparse_checkout_patterns(struct strbuf *full_path, struct pattern_list *pl, int dtype)
>> +{
>> +	enum pattern_match_result match = recursively_match_path_with_sparse_patterns(full_path->buf, the_repository->index, dtype, pl);
> 
> I for one would welcome e.g. abbreviating "sparse_checkout_patterns" as
> "scp" or whatever throughout, with resulting shortening of very long
> lines & identifiers. E.g. for this one "patch_match_scp" or whatever.
> 
> 

I don't mind either way, but a quick search doesn't reveal any of uses 
of an abbreviation like, though.

Either way, I could also drop the '_checkout_' from 
'sparse_checkout_patterns'. I don't think that would make it less clear 
what we are talking about.

>> +	struct sparse_filter_data *data = context;
>> +
> 
> Style: Don't add a \n\n between variable decls,
> 
>> +	int do_recurse = show_recursive(data->options, base->buf, base->len, pathname);
> 
> Instead add it here, before the code proper.
> 
> 

Alright. I'll apply that.

>> +	if (object_type(mode) == OBJ_TREE)
>> +		return do_recurse;
>> +
>> +	strbuf_reset(&data->full_path_buf);
> 
> I wondered if we need this after, but we don't, I for one would welcome a fix-up like:
> 	
> 	diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
> 	index 68d6ef675f2..74dfa9a4580 100644
> 	--- a/builtin/ls-tree.c
> 	+++ b/builtin/ls-tree.c
> 	@@ -410,19 +410,21 @@ static int filter_sparse(const struct object_id *oid, struct strbuf *base,
> 	 			 const char *pathname, unsigned mode, void *context)
> 	 {
> 	 	struct sparse_filter_data *data = context;
> 	-
> 	 	int do_recurse = show_recursive(data->options, base->buf, base->len, pathname);
> 	+	int ret = 0;
> 	+
> 	 	if (object_type(mode) == OBJ_TREE)
> 	 		return do_recurse;
> 	
> 	-	strbuf_reset(&data->full_path_buf);
> 	 	strbuf_addbuf(&data->full_path_buf, base);
> 	 	strbuf_addstr(&data->full_path_buf, pathname);
> 	
> 	 	if (!path_matches_sparse_checkout_patterns(&data->full_path_buf, &data->pl, DT_REG))
> 	-			return 0;
> 	-
> 	-	return data->fn(oid, base, pathname, mode, data->options);
> 	+		goto cleanup;
> 	+	ret = data->fn(oid, base, pathname, mode, data->options);
> 	+cleanup:
> 	+	strbuf_reset(&data->full_path_buf);
> 	+	return ret;
> 	 }
> 	
> 	 int cmd_ls_tree(int argc, const char **argv, const char *prefix)
> 	
> It's slightly more verbose, and we do end up needlessly reset-ing the
> "last" one, but makes it clear that we don't have need for this after
> this.
> 
> Which to me, also raises the question of why you have this
> "full_path_buf" at all. It looks like a memory optimization, as you're
> trying to avoid a malloc()/free() in the loop.
> 
> That's fair, but you could also do that with:
> 
> 	const size_t oldlen = base->len;
> 	strbuf_addstr(base, pathname);
>          /* do the path_matches_sparse_checkout_patterns() call here */
> 	/* before "cleanup" */
> 	strbuf_setlen(base, oldlen);
> 
> Or whatever, those snippets are untested, but we use similar tricks
> elsewhere, and as it's contained to a few lines here I think it's better
> than carrying another buffer.
> 

Okay. That makes sense. And if it is a common trick then I guess it 
isn't too mysterious.

>> +	strbuf_addbuf(&data->full_path_buf, base);
>> +	strbuf_addstr(&data->full_path_buf, pathname);
> 
> Nit: Shorter as: strbuf_addf(&sb, "%s%s", base.buf, pathname) (or equivalent);
> 
>> +
>> +	if (!path_matches_sparse_checkout_patterns(&data->full_path_buf, &data->pl, DT_REG))
>> +			return 0;
>> +
>> +	return data->fn(oid, base, pathname, mode, data->options);
>> +}
>> +
>>   int cmd_ls_tree(int argc, const char **argv, const char *prefix)
>>   {
>>   	struct object_id oid;
>> @@ -372,7 +432,11 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
>>   	read_tree_fn_t fn = NULL;
>>   	enum ls_tree_cmdmode cmdmode = MODE_DEFAULT;
>>   	int null_termination = 0;
>> +
> 
> Don't add this stray \n.
> 

Ok.

>>   	struct ls_tree_options options = { 0 };
>> +	char *sparse_oid_name = NULL;
>> +	void *read_tree_context = NULL;
>> +	struct sparse_filter_data *sparse_filter_data = NULL;
>>   	const struct option ls_tree_options[] = {
>>   		OPT_BIT('d', NULL, &options.ls_options, N_("only show trees"),
>>   			LS_TREE_ONLY),
>> @@ -399,6 +463,9 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
>>   					 N_("format to use for the output"),
>>   					 PARSE_OPT_NONEG),
>>   		OPT__ABBREV(&options.abbrev),
>> +		OPT_STRING_F(0, "filter-sparse-oid", &sparse_oid_name, N_("filter-sparse-oid"),
> 
> Make that N_(...) be "object-id", "oid", "hash" or something?
> 
> I.e. the RHS with the <type> should be <thingy>, not
> <thingy-for-some-purpose>.
> 

Right. I've used the "blob-ish" wording in the .txt file, I think it 
makes sense to reuse it here.

>> +			   N_("filter output with sparse-checkout specification contained in the blob"),
>> +			     PARSE_OPT_NONEG),
>>   		OPT_END()
>>   	};
>>   	struct ls_tree_cmdmode_to_fmt *m2f = ls_tree_cmdmode_format;
>> @@ -474,7 +541,17 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
>>   		break;
>>   	}
>>   
>> -	ret = !!read_tree(the_repository, tree, &options.pathspec, fn, &options);
>> +	if (sparse_oid_name) {
>> +		init_sparse_filter_data(&sparse_filter_data, &options, sparse_oid_name, fn);
>> +		read_tree_context = sparse_filter_data;
>> +		fn = filter_sparse;
>> +	} else
>> +		read_tree_context = &options;
> 
> You're missing a {} here for the "else".
> 
> Better yet, delete that "else" and change the decl above to be:
> 
> 	void *context = &options;
> 
> Then just keep the "if" here, where you might clobber the "context".
> 

I'll delete the else as you suggest.

>> +
>> +	ret = !!read_tree(the_repository, tree, &options.pathspec, fn, read_tree_context);
>>   	clear_pathspec(&options.pathspec);
>> +	if (sparse_filter_data)
>> +		free_sparse_filter_data(sparse_filter_data);
> 
> I suggested a fixup for this, but as an aside don't make a free_*()
> function that's not happy to accept NULL, it should work like the free()
> itself.
> 

Ah. That's a fair point. Thanks.

>> +
>>   	return ret;
>>   }
>> diff --git a/dir.c b/dir.c
>> index 4e99f0c868f..122ebced08e 100644
>> --- a/dir.c
>> +++ b/dir.c
>> @@ -1457,45 +1457,50 @@ int init_sparse_checkout_patterns(struct index_state *istate)
>>   	return 0;
>>   }
>>   
>> -static int path_in_sparse_checkout_1(const char *path,
>> -				     struct index_state *istate,
>> -				     int require_cone_mode)
>> +int recursively_match_path_with_sparse_patterns(const char *path,
>> +						struct index_state *istate,
>> +						int dtype,
>> +						struct pattern_list *pl)
>>   {
>> -	int dtype = DT_REG;
>>   	enum pattern_match_result match = UNDECIDED;
>>   	const char *end, *slash;
>> -
>> -	/*
>> -	 * We default to accepting a path if the path is empty, there are no
>> -	 * patterns, or the patterns are of the wrong type.
>> -	 */
>> -	if (!*path ||
>> -	    init_sparse_checkout_patterns(istate) ||
>> -	    (require_cone_mode &&
>> -	     !istate->sparse_checkout_patterns->use_cone_patterns))
>> -		return 1;
>> -
>>   	/*
>>   	 * If UNDECIDED, use the match from the parent dir (recursively), or
>>   	 * fall back to NOT_MATCHED at the topmost level. Note that cone mode
>>   	 * never returns UNDECIDED, so we will execute only one iteration in
>>   	 * this case.
>>   	 */
>> -	for (end = path + strlen(path);
>> -	     end > path && match == UNDECIDED;
>> +	for (end = path + strlen(path); end > path && match == UNDECIDED;
> 
> All good, aside from this whitespace refactoring, as noted.
> 

Ok.

>>   	     end = slash) {
>> -
>>   		for (slash = end - 1; slash > path && *slash != '/'; slash--)
>>   			; /* do nothing */
>>   
>>   		match = path_matches_pattern_list(path, end - path,
>>   				slash > path ? slash + 1 : path, &dtype,
>> -				istate->sparse_checkout_patterns, istate);
>> +				pl, istate);
>>   
>>   		/* We are going to match the parent dir now */
>>   		dtype = DT_DIR;
>>   	}
>> -	return match > 0;
>> +
>> +	return match;
>> +}
>> +
>> +static int path_in_sparse_checkout_1(const char *path,
>> +				     struct index_state *istate,
>> +				     int require_cone_mode)
>> +{
>> +	/*
>> +	 * We default to accepting a path if the path is empty, there are no
>> +	 * patterns, or the patterns are of the wrong type.
>> +	 */
>> +	if (!*path ||
>> +	    init_sparse_checkout_patterns(istate) ||
>> +	    (require_cone_mode &&
>> +	     !istate->sparse_checkout_patterns->use_cone_patterns))
>> +		return 1;
>> +
>> +	return recursively_match_path_with_sparse_patterns(path, istate, DT_REG, istate->sparse_checkout_patterns) > 0;
> 
> All good, except for the really long line, aside from the previous
> suggestion, maybe pull "istate->sparse_checkout_patterns" into a
> variable, as it's now used twice here in this function?
> 

I think that makes sense.

>> +check_agrees_with_ls_files () {
>> +	REPO=repo  &&
>> +	git -C $REPO submodule deinit -f --all &&
>> +	git -C $REPO cat-file -p $filter_oid >"$REPO/.git/info/sparse-checkout" &&
>> +	git -C $REPO sparse-checkout init --cone &&
>> +	git -C $REPO submodule init &&
>> +	git -C $REPO ls-files -t >out &&
>> +	sed -n "/^S /!s/^. //p" out >expect &&
>> +	test_cmp expect actual
> 
> Instead of this last "sed" munging, why not use the "--format" option to
> "ls-tree" to just make the output the same?

I hadn't thought about changing that side of the output. I would have to 
call ls-tree once more instead of relying on the 'expect' file to be 
there when the function is called. But that might be nicer anyway.

next prev parent reply	other threads:[~2023-01-24 15:30 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-11 17:01 [PATCH] ls-tree: add --sparse-filter-oid argument William Sprent via GitGitGadget
2023-01-13 14:17 ` Eric Sunshine
2023-01-13 20:01   ` Junio C Hamano
2023-01-16 15:13     ` William Sprent
2023-01-16 12:14   ` William Sprent
2023-01-23 11:46 ` [PATCH v2] " William Sprent via GitGitGadget
2023-01-23 13:00   ` Ævar Arnfjörð Bjarmason
2023-01-24 15:30     ` William Sprent
2023-01-23 13:06   ` Ævar Arnfjörð Bjarmason
2023-01-24 15:30     ` William Sprent [this message]
2023-01-24 20:11   ` Victoria Dye
2023-01-25 13:47     ` William Sprent
2023-01-25 18:32       ` Victoria Dye
2023-01-26 14:55         ` William Sprent
2023-01-25  5:11   ` Elijah Newren
2023-01-25 16:16     ` William Sprent
2023-01-26  3:25       ` Elijah Newren
2023-01-27 11:58         ` William Sprent
2023-01-28 16:45           ` Elijah Newren
2023-01-30 15:28             ` William Sprent

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9787a68d-80a3-f554-355c-5015798a8816@unity3d.com \
    --to=williams@unity3d.com \
    --cc=avarab@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=newren@gmail.com \
    --cc=sunshine@sunshineco.com \
    --cc=vdye@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).