git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: William Sprent <williams@unity3d.com>
To: Elijah Newren <newren@gmail.com>
Cc: "William Sprent via GitGitGadget" <gitgitgadget@gmail.com>,
	git@vger.kernel.org, "Eric Sunshine" <sunshine@sunshineco.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Derrick Stolee" <derrickstolee@github.com>,
	"Victoria Dye" <vdye@github.com>
Subject: Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
Date: Mon, 30 Jan 2023 16:28:09 +0100	[thread overview]
Message-ID: <2fd448af-cdd8-92ca-f714-472aac9c6db1@unity3d.com> (raw)
In-Reply-To: <CABPp-BGs2wG6a3oR8dmT9dkeakoiZ+w-Tf=4A-GXeDVkJ9QNMA@mail.gmail.com>

On 28/01/2023 17.45, Elijah Newren wrote:
> On Fri, Jan 27, 2023 at 3:59 AM William Sprent <williams@unity3d.com> wrote:
>>
>> On 26/01/2023 04.25, Elijah Newren wrote:
>>> On Wed, Jan 25, 2023 at 8:16 AM William Sprent <williams@unity3d.com> wrote:
>>>>
>>>> On 25/01/2023 06.11, Elijah Newren wrote:
>>>>> It looks like Ævar and Victoria have both given really good reviews
>>>>> already, but I think I spotted some additional things to comment on.
>>>>>
>>>>> On Mon, Jan 23, 2023 at 3:46 AM William Sprent via GitGitGadget
>>>>> <gitgitgadget@gmail.com> wrote:
>>>>>>
>>>>>> From: William Sprent <williams@unity3d.com>
>>>>>>
>>>>>> There is currently no way to ask git the question "which files would be
>>>>>> part of a sparse checkout of commit X with sparse checkout patterns Y".
>>>>>> One use-case would be that tooling may want know whether sparse checkouts
>>>>>> of two commits contain the same content even if the full trees differ.
>>>>>
>>>>> Could you say more about this usecase?  Why does tooling need or want
>>>>> to know this; won't a checkout of the new commit end up being quick
>>>>> and simple?  (I'm not saying your usecase is bad, just curious that it
>>>>> had never occurred to me, and I'm afraid I'm still not sure what your
>>>>> purpose might be.)
>>>>>
>>>>
>>>> I'm thinking mainly about a monorepo context where there are a number of
>>>> distinct 'units' that can be described with sparse checkout patterns.
>>>> And perhaps there's some tooling that only wants to perform an action if
>>>> the content of a 'unit' changes.
>>>
>>> So, you're basically wanting to do something like
>>>      git ls-tree --paths-matching-sparsity-file=<pattern-file> $COMMIT1
>>>> sparse-files
>>>      git ls-tree --paths-matching-sparsity-file=<pattern-file> $COMMIT2
>>>>> sparse-files
>>>      sort sparse-files | uniq >relevant-files
>>>      git diff --name-only $COMMIT1 $COMMIT2 >changed-files
>>> and then checking whether relevant-files and changed-files have a
>>> non-empty intersection?
>>
>> Well, the concrete use-case I'm exploring is something along the lines
>> of using the content hashes of sparse checkouts as cache keys for resource
>> heavy jobs (builds/tests/etc).
>>
>> So, that would be something along the lines of,
>>
>>       git ls-tree -r --paths-matching-sparsity-file=<pattern-file> \
>>       | sha1sum > cache-key
>>
>> and then performing a lookup before performing an action (which would
>> then only be done in the context of the sparse checkout). My thinking
>> is that this only would require git and no additional tooling, which in
>> turn makes it very easy to reproduce the state where the job took place.
>>
>>
>>> Would that potentially be better handled by
>>>      git diff --name-only $COMMIT1 $COMMIT2 | git check-ignore
>>> --ignore-file=<pattern-file> --stdin
>>> and seeing whether the output is non-empty?  We'd have to add an
>>> "--ignore-file" option to check-ignore to override reading of
>>> .gitignore files and such, and it'd be slightly confusing because the
>>> documentation talks about "ignored" files rather than "selected"
>>> files, but that's a confusion point that has been with us ever since
>>> the gitignore mechanism was repurposed for sparse checkouts.  Or maybe
>>> we could just add a check-sparsity helper, and then allow it to take
>>> directories in-lieu of patterns.
>>
>> I don't think it necessarily would be better handled by that. But it would
>> be workable. It would be a matter of collating the output of
>>
>>     git ls-tree -r <commit>
>>
>> with
>>
>>     git ls-tree --name-only -r <commit> | git check-ignore ...
>>
>> Which is less ergonomic. But it is also a less intrusive change.
>>
>> Really, the main thing is to expose the sparse filtering logic somehow, and
>> allow for building tooling on top of it.
>>> This seems nicer than opening a can of worms about letting every git
>>> command specify a different set of sparsity rules.
>>
>> I think you are the better judge of how much of a can of worms that would
>> be. I don't think it would be too out of line with how git acts in general
>> though, as we have things like the the 'GIT_INDEX_FILE' env-var.
> 
> I agree you've got a reasonable usecase here.  The integration you've
> described sounds like something that could be independently composable
> with several other commands, and you alluded to that in your commit
> message.  But is adding it to ls-tree the best spot?  If we add it
> there, then folks who want to similarly integrate such capabilities
> with other commands (archive, diff, grep, rev-list --objects, etc.),
> seem more likely to do so via direct integration.  We already have a
> very large can of worms to work on to make commands behave in ways
> that are limited to sparse paths (see
> Documentation/techncial/sparse-checkout.txt, namely the "behavior A"
> stuff).  As can be seen in that document, what to do for limiting
> commands to sparse paths is obvious with some commands but has lots of
> interesting edge cases for others (even with years of experience with
> sparse checkouts, we had 3- and 4- way differing opinions on the
> intended behavior for some commands when we started composing that
> document a few months ago).  If we had jumped straight to
> implementation for some commands, we would have likely painted
> ourselves into a corner for other commands.  Adding another layer of
> specifying an alternate set of sparsity rules will likely have
> interesting knock-on effects that we should think through for all the
> commands to ensure we aren't painting ourselves into a similar corner,
> if we go down this route.
> 
> However, in the cases that sparse-checkout.txt document was
> addressing, the behavior fundamentally needs to be integrated into all
> the relevant commands to get user experience right.  In your case, you
> merely need a separate tool to be able to compose the output of
> different commands together.  So, exposing whether sparsity rules
> would select various paths in a single separate command (maybe git
> check-ignore, or a new sparse-checkout subcommand, or maybe just a new
> subcommand similar to check-ignore) would avoid a lot of these issues,
> and give us a single place to extend/improve as we learn about further
> usecases.

I do think that 'ls-tree' is ultimately (eventually?) the right place for
something like this. But you do make some good points about it perhaps being a
bit early.

> 
> [...]
>>>> I agree that it is a bit awkward to have to "translate" the directories
>>>> into patterns when wanting to use cone mode. I can try adding
>>>> '--[no]-cone' flags and see how that feels. Together with Victoria's
>>>> suggestions that would result in having the following flags:
>>>>
>>>> * --scope=(sparse|all)
>>>> * --sparse-patterns-file=<path>
>>>> * --[no]-cone: used together with --sparse-patterns-file to tell git
>>>>      whether to interpret the patterns given as directories (cone) or
>>>>      patterns (no-cone).
>>>>
>>>> Which seems like a lot at first glance. But it allows for passing
>>>> directories instead of patterns for cone mode, and is similar to the
>>>> behaviour of 'sparse-checkout set'.
>>>>
>>>> Does that seem like something that would make sense?
>>>
>>> --sparse-patterns-file still implies patterns; I think that would need
>>> some rewording.
>>
>> Yeah. After sleeping on it, I also think that it becomes a difficult
>> interface to work with, and you'll get different results with the same
>> patterns whether you pass --cone or --no-cone, which seems error prone
>> to me.
>>
>> For better or for worse, both cone and non-cone uses of sparse-checkouts
>> end up producing pattern files. And those pattern files do unambiguously
>> describe a filtering of the worktree whether it is in cone-mode or not.
> 
> Back when cone mode was introduced, I objected to reusing the existing
> pattern format and/or files...but was assuaged that it was just an
> implementation detail that wouldn't be exposed to users (since people
> would interact with the 'sparse-checkout' command instead of direct
> editing of $GIT_DIR/info/sparse-checkout).  It's still a performance
> penalty, because we waste time both turning directory names into
> patterns when we write them out, and when we read them in by parsing
> the patterns to see if they match cone mode rules and if so discard
> the patterns and just build up our hashes.  The patterns are nothing
> more than an intermediary format we waste time translating to and
> from, though once upon a time they existed so that if someone suddenly
> started using an older (now ancient) version of git on their current
> checkout then it could still hobble along with degraded performance
> instead of providing an error.  (We have introduced other changes to
> the config and git index which would cause older git clients to just
> fail to parse and throw an error, and even have defined mechanisms for
> such erroring out.  We could have taken advantage of that for this
> feature too.)
> 
> Anyway, long story short, if you're going to start exposing users to
> this implementation detail that was meant to be hidden (and do it in a
> way that may be copied into several commands to boot), then I
> definitely object.

Alright. Fair. I think with that additional context, I agree. I was coming from
a "this is a common format to both cone and non-cone modes" not a "this is a
leftover implementation detail from now-deprecated use cases" which is the vibe
I'm getting here.

I still think it would be unfortunate to have a single input parameter that
takes both kinds of specifications and interprets them either as cone or non-
cone depending on config and/or a flag. With cone mode specifications like
'a/b/c' being valid gitignore patterns there's no way of knowing whether the
user actually is supplying a pattern or accidentally supplied a directory in
non-cone mode.

With that in mind, I'd probably suggest something along the lines of having
'ls-tree' only accept cone-mode specifications (e.g. --cone-rules-file or so) to
avoid having the non-cone/cone distinction spread outside the 'sparse-checkout'
command. But that would be leaving non-cone users behind, and I guess we are not
quite there yet.
  
>> Given that 'ls-tree' is more of a plumbing command, I think it might still
>> make sense to use the patterns. That would also make the interaction
>> a bit more logical to me -- e.g. if you want to override the patterns
>> you have to pass them in the same format as the ones that would be read
>> by default.
> 
> No, sparsity specification should be provided by users the same way
> they normally specify it (i.e. the way they pass it to
> `sparse-checkout {add,set}`), not the way it's stored via some hidden
> implementation detail.
> 
> I'd make an exception if you ended up using `git check-ignore` because
> that command was specifically written about gitignore-style rules, and
> git-ignore-style rules just happen to have been reused as-is for
> non-cone-mode sparse-checkout rules.  But if you go that route:
>     (1) you have to frame any extensions to check-ignore as things that
> are useful for gitignore checking, and only incidentally useful to
> sparse-checkouts
>     (2) you'd have to forgo any cone-mode optimizations
> Going the check-ignore route seems like the easiest path to providing
> the basic functionality you need right now.  If your usecases remain
> niche, that might be good enough for all time too.  If your usecases
> become popular, though, I expect someone would eventually tire of
> using `check-ignore` and implement similar functionality along with
> supporting cone-mode in a `git check-sparsity` or `git sparse-checkout
> debug-rules` command or something like that.

I guess that would be workable. But I would only want to use the command with
cone-mode patterns, so losing the cone-mode optimisations would be suboptimal.

If the 'ls-tree' path isn't the best way forward right now, do you think it
would be viable to add a subcommand along the lines of 'debug-rules/check-rules'
to sparse-checkout?

Then

     git ls-tree -r HEAD | git sparse-checkout debug-rules

would output only the path that match the current patterns. And

     <...> | git sparse-checkout debug-rules --rules-file <file> --[no]-cone

would be equivalent to setting the patterns with

    cat <file> | git sparse-checkout set --stdin --[no]-cone

before running the command above.

Then all the cone/no-cone pattern specification stuff is still contained within
the 'sparse-checkout' command and the documentation for the sub-command would
reside in the sparse-checkout.tx file next to all the caveats and discussion about
sparse-checkout.



      reply	other threads:[~2023-01-30 15:28 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-11 17:01 [PATCH] ls-tree: add --sparse-filter-oid argument William Sprent via GitGitGadget
2023-01-13 14:17 ` Eric Sunshine
2023-01-13 20:01   ` Junio C Hamano
2023-01-16 15:13     ` William Sprent
2023-01-16 12:14   ` William Sprent
2023-01-23 11:46 ` [PATCH v2] " William Sprent via GitGitGadget
2023-01-23 13:00   ` Ævar Arnfjörð Bjarmason
2023-01-24 15:30     ` William Sprent
2023-01-23 13:06   ` Ævar Arnfjörð Bjarmason
2023-01-24 15:30     ` William Sprent
2023-01-24 20:11   ` Victoria Dye
2023-01-25 13:47     ` William Sprent
2023-01-25 18:32       ` Victoria Dye
2023-01-26 14:55         ` William Sprent
2023-01-25  5:11   ` Elijah Newren
2023-01-25 16:16     ` William Sprent
2023-01-26  3:25       ` Elijah Newren
2023-01-27 11:58         ` William Sprent
2023-01-28 16:45           ` Elijah Newren
2023-01-30 15:28             ` William Sprent [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2fd448af-cdd8-92ca-f714-472aac9c6db1@unity3d.com \
    --to=williams@unity3d.com \
    --cc=avarab@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=newren@gmail.com \
    --cc=sunshine@sunshineco.com \
    --cc=vdye@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).