git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git-grep in sparse checkout
@ 2019-10-01 13:06 Matheus Tavares Bernardino
  2019-10-01 13:30 ` Bert Wesarg
  2019-10-01 18:29 ` Derrick Stolee
  0 siblings, 2 replies; 9+ messages in thread
From: Matheus Tavares Bernardino @ 2019-10-01 13:06 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Derrick Stolee, Elijah Newren, Taylor Blau

Hi,

During Git Summit it was mentioned that git-grep searches outside
sparsity pattern which is not aligned with user expectation. I took a
quick look at it and it seems the reason is
builtin/grep.c:grep_cache() (which also greps worktree) will grep the
object store when a given index entry has the CE_SKIP_WORKTREE bit
turned on.

From what I understand, this bit is used exactly for sparse checkouts
(as described in Documentation/technical/index-format.txt[1]). But
should we perhaps ignore it in git-grep to have the expected behavior?
I'll be happy to send the patch if so, but I wanted to check with you
first.

Grepping with --cached or in given trees objects would still grep
outside the cone, but that is what one would expect, right? Or should
we filter out what is outside of the cone for cached grep as well?

Thanks,
Matheus

[1]: https://github.com/git/git/blob/master/Documentation/technical/index-format.txt#L101

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git-grep in sparse checkout
  2019-10-01 13:06 git-grep in sparse checkout Matheus Tavares Bernardino
@ 2019-10-01 13:30 ` Bert Wesarg
  2019-10-01 16:12   ` Elijah Newren
  2019-10-01 18:29 ` Derrick Stolee
  1 sibling, 1 reply; 9+ messages in thread
From: Bert Wesarg @ 2019-10-01 13:30 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: git, Jeff King, Derrick Stolee, Elijah Newren, Taylor Blau

Hi,

On Tue, Oct 1, 2019 at 3:06 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> Hi,
>
> During Git Summit it was mentioned that git-grep searches outside
> sparsity pattern which is not aligned with user expectation. I took a
> quick look at it and it seems the reason is
> builtin/grep.c:grep_cache() (which also greps worktree) will grep the
> object store when a given index entry has the CE_SKIP_WORKTREE bit
> turned on.
>

I also had once this problem and found that out and wrote a patch. I
was just about to send this patch out.

Btw, ls-files should also learn to skip worktree files.

Stay tuned.

Bert

> From what I understand, this bit is used exactly for sparse checkouts
> (as described in Documentation/technical/index-format.txt[1]). But
> should we perhaps ignore it in git-grep to have the expected behavior?
> I'll be happy to send the patch if so, but I wanted to check with you
> first.
>
> Grepping with --cached or in given trees objects would still grep
> outside the cone, but that is what one would expect, right? Or should
> we filter out what is outside of the cone for cached grep as well?
>
> Thanks,
> Matheus
>
> [1]: https://github.com/git/git/blob/master/Documentation/technical/index-format.txt#L101

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git-grep in sparse checkout
  2019-10-01 13:30 ` Bert Wesarg
@ 2019-10-01 16:12   ` Elijah Newren
  2019-10-02  6:33     ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Elijah Newren @ 2019-10-01 16:12 UTC (permalink / raw)
  To: Bert Wesarg
  Cc: Matheus Tavares Bernardino, git, Jeff King, Derrick Stolee,
	Taylor Blau

On Tue, Oct 1, 2019 at 6:30 AM Bert Wesarg <bert.wesarg@googlemail.com> wrote:
>
> Hi,
>
> On Tue, Oct 1, 2019 at 3:06 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > Hi,
> >
> > During Git Summit it was mentioned that git-grep searches outside
> > sparsity pattern which is not aligned with user expectation. I took a
> > quick look at it and it seems the reason is
> > builtin/grep.c:grep_cache() (which also greps worktree) will grep the
> > object store when a given index entry has the CE_SKIP_WORKTREE bit
> > turned on.
> >
>
> I also had once this problem and found that out and wrote a patch. I
> was just about to send this patch out.
>
> Btw, ls-files should also learn to skip worktree files.
>
> Stay tuned.

I too have a small patch for just grep without --cached or revisions
(it's only a few lines), but it's very incomplete as that is the only
usecase it handles.  For the usecases I'm closest too, what users have
reported they want is essentially a miniature repo where they work on
stuff they care about and ignore the rest.  As such, the desired
functionality for these users is:

* git grep, by default, should only search within the sparsity pattern
* git grep --cached and git grep $REVISION should also only search
within the sparsity pattern
* git diff $REV1 $REV2, git diff $REV1, etc., should by default only
search within sparsity patterns
* git log should by default only show commits modifying files matching
the sparsity patterns
* for all of these, there should be some kind of
--ignore-sparsity-patterns flag to allow searching outside the
sparsity pattern
* other commands (archive, bisect, clean?, gitk, shortlog, blame,
fsck?, etc.) likely need to pay attention to sparsity patterns as
well, but there are some special cases:

* merge, cherry-pick, and rebase (anything touching the merge
machinery) will need to expand the size of the non-sparse worktree if
there are files outside the sparsity patterns with conflicts.  (Though
merge should do a better job of not expanding the non-sparse worktree
when files can cleanly be resolved.)
* ls-files has a -t option which can be used to differentiate which
entries in the index are skip-worktree (S) and which are not.  As
such, use of that flag should probably imply
--ignore-sparsity-patterns
* fast-export and format-patch are not about viewing history but about
exporting it, and limiting to sparsity patterns would result in the
creation of an incompatible history.  As such, they should error out
without a --ignore-sparsity-patterns when invoked from a repository
that has a sparse checkout.
* We may want to augment status with additional information to remind
users they are in a sparse checkout
* New worktrees, by default, should copy the sparsity-patterns of the
worktree they were created from (much like a new shell inherits the
current working directory of it's parent process)

I should note here that Stolee wasn't so sure about having 'log' only
show commits that touched files within the sparse patterns, so we may
need some kind of config setting and have a good usability story for
what each of the settings means and usecases in order to guide how to
handle weird cases better.

Also, as if that weren't enough, there are two more challenges too:
1) As pointed out by Dscho in the contributor summit, we want
intersection of pathspecs specified by the user and those in the
sparsity patterns; e.g. if the user says `git diff $REV -- '*.c' `, we
want to show them a diff against $REV of all .c files that are within
their sparsity patterns.
2) We have two different kinds of path patterns, one for .gitignore
and sparse-checkout, and the other for command-line pathspecs.  See
https://public-inbox.org/git/xmqq4l1qpiaw.fsf@gitster-ct.c.googlers.com/.
These differences might make the implementation more difficult, and
making the two types of path patterns have more overlap might be a
necessary first step.

However, the "work with a miniature repo" probably makes the VFS for
Git and partial clone stuff easier -- we don't have to worry about the
incessant need to download more blobs after the initial partial clone
because git commands by default would avoid requesting them.  It also
would work quite nicely with a partial index -- we could have a
directory entry in the index and marked as skip-worktree and avoid
having all the paths under it show up in the index.  That would
accelerate many operations within git.



I'd love to work on this, but I've got plenty of other things on my
plate at the moment, so I probably won't get time for it at least
until the middle of next year.  But I thought I'd send out what I view
as the bigger picture.  Also, this is very much still idea stage; the
contributor summit refined some of the ideas and there may be more
refinement as more people in the list chime in.


Hope that helps,
Elijah

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git-grep in sparse checkout
  2019-10-01 13:06 git-grep in sparse checkout Matheus Tavares Bernardino
  2019-10-01 13:30 ` Bert Wesarg
@ 2019-10-01 18:29 ` Derrick Stolee
  2019-10-02  0:06   ` Matheus Tavares Bernardino
  2019-10-02  6:18   ` Junio C Hamano
  1 sibling, 2 replies; 9+ messages in thread
From: Derrick Stolee @ 2019-10-01 18:29 UTC (permalink / raw)
  To: Matheus Tavares Bernardino, git
  Cc: Jeff King, Derrick Stolee, Elijah Newren, Taylor Blau

On 10/1/2019 9:06 AM, Matheus Tavares Bernardino wrote:
> Hi,
> 
> During Git Summit it was mentioned that git-grep searches outside
> sparsity pattern which is not aligned with user expectation. I took a
> quick look at it and it seems the reason is
> builtin/grep.c:grep_cache() (which also greps worktree) will grep the
> object store when a given index entry has the CE_SKIP_WORKTREE bit
> turned on.
> 
> From what I understand, this bit is used exactly for sparse checkouts
> (as described in Documentation/technical/index-format.txt[1]). But
> should we perhaps ignore it in git-grep to have the expected behavior?
> I'll be happy to send the patch if so, but I wanted to check with you
> first.

Is that the expected behavior? In a sparse-checkout, wouldn't you _want_
Git to report things outside the cone? You can already use external tools
to search for things in the sparse cone: they are on disk. You need "git
grep" for the objects reachable from the current tree but not already
on disk.

I respect the goal to minimize the work "git grep" is doing, especially
in a sparse-checkout + partial-clone world, where we wouldn't expect to
have the blobs locally and this search would cause many blob downloads.
I just want to truly examine if this is the right behavior.

At minimum, I would expect a new option to have "git grep" go back to
the old behavior, so users who really want a tree-wide search can have
one.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git-grep in sparse checkout
  2019-10-01 18:29 ` Derrick Stolee
@ 2019-10-02  0:06   ` Matheus Tavares Bernardino
  2019-10-02  6:18   ` Junio C Hamano
  1 sibling, 0 replies; 9+ messages in thread
From: Matheus Tavares Bernardino @ 2019-10-02  0:06 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jeff King, Derrick Stolee, Elijah Newren, Taylor Blau

On Tue, Oct 1, 2019 at 3:29 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 10/1/2019 9:06 AM, Matheus Tavares Bernardino wrote:
> > Hi,
> >
> > During Git Summit it was mentioned that git-grep searches outside
> > sparsity pattern which is not aligned with user expectation. I took a
> > quick look at it and it seems the reason is
> > builtin/grep.c:grep_cache() (which also greps worktree) will grep the
> > object store when a given index entry has the CE_SKIP_WORKTREE bit
> > turned on.
> >
> > From what I understand, this bit is used exactly for sparse checkouts
> > (as described in Documentation/technical/index-format.txt[1]). But
> > should we perhaps ignore it in git-grep to have the expected behavior?
> > I'll be happy to send the patch if so, but I wanted to check with you
> > first.
>
> Is that the expected behavior? In a sparse-checkout, wouldn't you _want_
> Git to report things outside the cone? You can already use external tools
> to search for things in the sparse cone: they are on disk. You need "git
> grep" for the objects reachable from the current tree but not already
> on disk.

Hmm, we can use external tools to search in the sparse cone, but even
in this circumstance, I think we need git-grep for some usecases.
git-grep on disk can exclude files not being tracked and access other
git functionalities such as --textconv to read .gitattributes and
convert the files before grepping. So maybe people would want to grep
just the cone but also have these other options that would be
unavailable through external tools?

(Users could already simulate such behavior giving the sparse clone
patterns as pathspecs to git-grep, but that can get complicated for
more complex sparse patterns.)

> I respect the goal to minimize the work "git grep" is doing, especially
> in a sparse-checkout + partial-clone world, where we wouldn't expect to
> have the blobs locally and this search would cause many blob downloads.
> I just want to truly examine if this is the right behavior.
>
> At minimum, I would expect a new option to have "git grep" go back to
> the old behavior, so users who really want a tree-wide search can have
> one.

Yes, I totally agree that this behavior change should come with such
an option. Maybe --ignore-sparsity-patterns, as Elijah suggested.

> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git-grep in sparse checkout
  2019-10-01 18:29 ` Derrick Stolee
  2019-10-02  0:06   ` Matheus Tavares Bernardino
@ 2019-10-02  6:18   ` Junio C Hamano
  2019-10-02 12:09     ` Derrick Stolee
  1 sibling, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2019-10-02  6:18 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Matheus Tavares Bernardino, git, Jeff King, Derrick Stolee,
	Elijah Newren, Taylor Blau

Derrick Stolee <stolee@gmail.com> writes:

> Is that the expected behavior? In a sparse-checkout, wouldn't you _want_
> Git to report things outside the cone?

That should be optional, I would think.  When you declare "by
default, this the subset of the project I am interested in", we
should honor it, I would think.

> At minimum, I would expect a new option to have "git grep" go back to
> the old behavior, so users who really want a tree-wide search can have
> one.

Yeah, a bugfix to honor SKIP_WORKTREE bit, followed by a new feature
to ignore it, would be pretty sensible way to go.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git-grep in sparse checkout
  2019-10-01 16:12   ` Elijah Newren
@ 2019-10-02  6:33     ` Junio C Hamano
  2019-10-02 16:46       ` Elijah Newren
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2019-10-02  6:33 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Bert Wesarg, Matheus Tavares Bernardino, git, Jeff King,
	Derrick Stolee, Taylor Blau

Elijah Newren <newren@gmail.com> writes:

> * other commands (archive, bisect, clean?, gitk, shortlog, blame,
> fsck?, etc.) likely need to pay attention to sparsity patterns as
> well, but there are some special cases:

"git archive" falls into the same class as fast-(im|ex)port; it
should ignore the sparse cone by default.  I suspect you threw
"fsck" as a joke, but I do not think it should pay attention to the
sparse cone, either (besides, most of the time in fsck the objects
subject to checking do not know all the paths that reach them).

> * merge, cherry-pick, and rebase (anything touching the merge
> machinery) will need to expand the size of the non-sparse worktree if
> there are files outside the sparsity patterns with conflicts.  (Though
> merge should do a better job of not expanding the non-sparse worktree
> when files can cleanly be resolved.)

I think the important point is what is done to the result of
operation.  Result of these operations that create new commits are
meant to be consumed by other people, who may not share your
definition of sparse cone.  And such a command (i.e. those whose
results are consumed by others who may have different sparse cone)
must be full-tree by default.

> * fast-export and format-patch are not about viewing history but about
> exporting it, and limiting to sparsity patterns would result in the
> creation of an incompatible history.

I agree with the conclusion; see above.

> * New worktrees, by default, should copy the sparsity-patterns of the
> worktree they were created from (much like a new shell inherits the
> current working directory of it's parent process)

Sorry, but I do not share this view at all.

In my mental model, "worktree new" is to attach a brand-new worktree
to a bare repository that underlies the existing worktree I happen
to be in, and that existing worktree that I happen to type "worktree
new" in is no more or no less special than other worktrees.

The above isn't to say that I'd veto your "a new worktree inherits
traits from an existing worktree that 'git worktree add' was invoked
in" idea.  I am just saying that I have a problem with that mode of
operation and mental model being the default.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git-grep in sparse checkout
  2019-10-02  6:18   ` Junio C Hamano
@ 2019-10-02 12:09     ` Derrick Stolee
  0 siblings, 0 replies; 9+ messages in thread
From: Derrick Stolee @ 2019-10-02 12:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matheus Tavares Bernardino, git, Jeff King, Derrick Stolee,
	Elijah Newren, Taylor Blau

On 10/2/2019 2:18 AM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> Is that the expected behavior? In a sparse-checkout, wouldn't you _want_
>> Git to report things outside the cone?
> 
> That should be optional, I would think.  When you declare "by
> default, this the subset of the project I am interested in", we
> should honor it, I would think.
> 
>> At minimum, I would expect a new option to have "git grep" go back to
>> the old behavior, so users who really want a tree-wide search can have
>> one.
> 
> Yeah, a bugfix to honor SKIP_WORKTREE bit, followed by a new feature
> to ignore it, would be pretty sensible way to go.

Thanks for everyone's responses here. I am satisfied with this direction.

-Stolee


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git-grep in sparse checkout
  2019-10-02  6:33     ` Junio C Hamano
@ 2019-10-02 16:46       ` Elijah Newren
  0 siblings, 0 replies; 9+ messages in thread
From: Elijah Newren @ 2019-10-02 16:46 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Bert Wesarg, Matheus Tavares Bernardino, git, Jeff King,
	Derrick Stolee, Taylor Blau

On Tue, Oct 1, 2019 at 11:33 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > * other commands (archive, bisect, clean?, gitk, shortlog, blame,
> > fsck?, etc.) likely need to pay attention to sparsity patterns as
> > well, but there are some special cases:
>
> "git archive" falls into the same class as fast-(im|ex)port; it
> should ignore the sparse cone by default.  I suspect you threw
> "fsck" as a joke, but I do not think it should pay attention to the
> sparse cone, either (besides, most of the time in fsck the objects
> subject to checking do not know all the paths that reach them).

archive in the same category as fast-(im|ex)port makes sense.  I'm not
sure if "ignore the sparse cone" by default makes sense or if it
should be a case where we error out if --ignore-sparsity-patterns
isn't specified, especially if history is also sparse.

In terms of fsck, I agree that if history is dense and the worktree is
sparse that you want to walk all history.  I was thinking further
along the lines when partial clones and sparse checkouts are combined
so that history is also sparse.  In cases where a partial clone is in
use, rather than download everything in order to walk it, wouldn't it
make more sense to have fsck walk over the bits that are already
downloaded?  I don't really know how that'd all work, but it seems
that if fsck walked over all history it'd be treated as a
useless/dangerous command by those who are doing partial clones
because the repo is just too big.

> > * merge, cherry-pick, and rebase (anything touching the merge
> > machinery) will need to expand the size of the non-sparse worktree if
> > there are files outside the sparsity patterns with conflicts.  (Though
> > merge should do a better job of not expanding the non-sparse worktree
> > when files can cleanly be resolved.)
>
> I think the important point is what is done to the result of
> operation.  Result of these operations that create new commits are
> meant to be consumed by other people, who may not share your
> definition of sparse cone.  And such a command (i.e. those whose
> results are consumed by others who may have different sparse cone)
> must be full-tree by default.
>
> > * fast-export and format-patch are not about viewing history but about
> > exporting it, and limiting to sparsity patterns would result in the
> > creation of an incompatible history.
>
> I agree with the conclusion; see above.
>
> > * New worktrees, by default, should copy the sparsity-patterns of the
> > worktree they were created from (much like a new shell inherits the
> > current working directory of it's parent process)
>
> Sorry, but I do not share this view at all.
>
> In my mental model, "worktree new" is to attach a brand-new worktree
> to a bare repository that underlies the existing worktree I happen
> to be in, and that existing worktree that I happen to type "worktree
> new" in is no more or no less special than other worktrees.
>
> The above isn't to say that I'd veto your "a new worktree inherits
> traits from an existing worktree that 'git worktree add' was invoked
> in" idea.  I am just saying that I have a problem with that mode of
> operation and mental model being the default.

If worktrees are the only area we disagree on, then I'll happily take
the stuff we agree on and can overlook this piece.

But, perhaps some further explaining on worktrees might help us reach
some middle ground. If worktrees are dense by default and folks have
not only sparse checkouts but sparse history, then creating a new
worktree would suddenly mandate downloading a lot more of history --
which could be prohibitively expensive, forcing people to instead have
N clones without any shared history.  That may be fine (I tend to not
be a heavy worktree user, I just support some users who are), but is
it the route we want to push people with big repos towards?


Thanks for the feedback on the ideas,
Elijah

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-10-02 16:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-01 13:06 git-grep in sparse checkout Matheus Tavares Bernardino
2019-10-01 13:30 ` Bert Wesarg
2019-10-01 16:12   ` Elijah Newren
2019-10-02  6:33     ` Junio C Hamano
2019-10-02 16:46       ` Elijah Newren
2019-10-01 18:29 ` Derrick Stolee
2019-10-02  0:06   ` Matheus Tavares Bernardino
2019-10-02  6:18   ` Junio C Hamano
2019-10-02 12:09     ` Derrick Stolee

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).