git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] grep: Add option --max-line-len
@ 2017-11-23 15:41 Marc-Antoine Ruel
  2017-11-23 19:24 ` Eric Sunshine
  2017-11-24  1:44 ` Junio C Hamano
  0 siblings, 2 replies; 5+ messages in thread
From: Marc-Antoine Ruel @ 2017-11-23 15:41 UTC (permalink / raw)
  To: git; +Cc: Marc-Antoine Ruel

This tells git grep to skip files longer than a specified length,
which is often the result of generators and not actual source files.

Signed-off-by: Marc-Antoine Ruel <maruel@chromium.org>
---
 Documentation/git-grep.txt | 5 +++++
 builtin/grep.c             | 2 ++
 grep.c                     | 4 ++++
 grep.h                     | 1 +
 t/t7810-grep.sh            | 5 +++++
 5 files changed, 17 insertions(+)

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 18b494731..75081defb 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -10,6 +10,7 @@ SYNOPSIS
 --------
 [verse]
 'git grep' [-a | --text] [-I] [--textconv] [-i | --ignore-case] [-w | --word-regexp]
+	   [-M | --max-line-len <num>]
 	   [-v | --invert-match] [-h|-H] [--full-name]
 	   [-E | --extended-regexp] [-G | --basic-regexp]
 	   [-P | --perl-regexp]
@@ -127,6 +128,10 @@ OPTIONS
 	beginning of a line, or preceded by a non-word character; end at
 	the end of a line or followed by a non-word character).
 
+-M<num>::
+--max-line-len<num>::
+	Match the pattern only for line shorter or equal to this length.
+
 -v::
 --invert-match::
 	Select non-matching lines.
diff --git a/builtin/grep.c b/builtin/grep.c
index 5a6cfe6b4..cc5c70be5 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -796,6 +796,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 			N_("case insensitive matching")),
 		OPT_BOOL('w', "word-regexp", &opt.word_regexp,
 			N_("match patterns only at word boundaries")),
+		OPT_INTEGER('M', "max-line-len", &opt.max_line_length,
+			N_("ignore lines longer than <n>")),
 		OPT_SET_INT('a', "text", &opt.binary,
 			N_("process binary files as text"), GREP_BINARY_TEXT),
 		OPT_SET_INT('I', NULL, &opt.binary,
diff --git a/grep.c b/grep.c
index d0b9b6cdf..881078b82 100644
--- a/grep.c
+++ b/grep.c
@@ -36,6 +36,7 @@ void init_grep_defaults(void)
 	opt->relative = 1;
 	opt->pathname = 1;
 	opt->max_depth = -1;
+	opt->max_line_length = -1;
 	opt->pattern_type_option = GREP_PATTERN_TYPE_UNSPECIFIED;
 	color_set(opt->color_context, "");
 	color_set(opt->color_filename, "");
@@ -151,6 +152,7 @@ void grep_init(struct grep_opt *opt, const char *prefix)
 	opt->pattern_type_option = def->pattern_type_option;
 	opt->linenum = def->linenum;
 	opt->max_depth = def->max_depth;
+	opt->max_line_length = def->max_line_length;
 	opt->pathname = def->pathname;
 	opt->relative = def->relative;
 	opt->output = def->output;
@@ -1273,6 +1275,8 @@ static int match_line(struct grep_opt *opt, char *bol, char *eol,
 	struct grep_pat *p;
 	regmatch_t match;
 
+	if (opt->max_line_length > 0 && eol-bol > opt->max_line_length)
+		return 0;
 	if (opt->extended)
 		return match_expr(opt, bol, eol, ctx, collect_hits);
 
diff --git a/grep.h b/grep.h
index 399381c90..0e76c0a19 100644
--- a/grep.h
+++ b/grep.h
@@ -151,6 +151,7 @@ struct grep_opt {
 	int null_following_name;
 	int color;
 	int max_depth;
+	int max_line_length;
 	int funcname;
 	int funcbody;
 	int extended_regexp_option;
diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index 2a6679c2f..c514bd388 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -766,6 +766,11 @@ test_expect_success 'grep -W shows no trailing empty lines' '
 	test_cmp expected actual
 '
 
+test_expect_success 'grep skips long lines' '
+	git grep -M18 -W include >actual &&
+	test_cmp expected actual
+'
+
 cat >expected <<EOF
 hello.c=	printf("Hello world.\n");
 hello.c:	return 0;
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] grep: Add option --max-line-len
  2017-11-23 15:41 [PATCH] grep: Add option --max-line-len Marc-Antoine Ruel
@ 2017-11-23 19:24 ` Eric Sunshine
  2017-11-24  1:44 ` Junio C Hamano
  1 sibling, 0 replies; 5+ messages in thread
From: Eric Sunshine @ 2017-11-23 19:24 UTC (permalink / raw)
  To: Marc-Antoine Ruel; +Cc: Git List

On Thu, Nov 23, 2017 at 10:41 AM, Marc-Antoine Ruel <maruel@chromium.org> wrote:
> This tells git grep to skip files longer than a specified length,

s/files/lines/

> which is often the result of generators and not actual source files.
>
> Signed-off-by: Marc-Antoine Ruel <maruel@chromium.org>
> ---
> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> @@ -127,6 +128,10 @@ OPTIONS
> +-M<num>::
> +--max-line-len<num>::
> +       Match the pattern only for line shorter or equal to this length.

This documentation doesn't explain what it means by a line's length.
This implementation seems to take into consideration only the line's
byte count, however, given that displayed lines are normally
tab-expanded, people might intuitively expect that expansion to count
toward the length. A similar question arises for Unicode characters.

Should this option take tab-expansion and Unicode into account?
Regardless of the answer to that question, the documentation should
make it clear what "line length" means.

> diff --git a/builtin/grep.c b/builtin/grep.c
> @@ -796,6 +796,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>                         N_("case insensitive matching")),
>                 OPT_BOOL('w', "word-regexp", &opt.word_regexp,
>                         N_("match patterns only at word boundaries")),
> +               OPT_INTEGER('M', "max-line-len", &opt.max_line_length,
> +                       N_("ignore lines longer than <n>")),

On this project, it is typical to have only the long form of an option
name when the option is first implemented so as not to squat on one of
the relatively limited number of short option names. Only after it is
seen that an option is popular, does it get a short name. Whether this
actually matters in this case, I don't know, but is something to take
into consideration.

Why the choice of 'M'? Is there precedence in other Git or Unix
commands for that name over, say, 'N' or
<choose-your-favorite-letter>? Is 'M' is for "max"? If so, what would
a short name for --max-depth be (if it ever got one)? (These are
somewhat rhetorical questions, but I'm interested in the answers if
you have any.)

> diff --git a/grep.c b/grep.c
> @@ -1273,6 +1275,8 @@ static int match_line(struct grep_opt *opt, char *bol, char *eol,
> +       if (opt->max_line_length > 0 && eol-bol > opt->max_line_length)
> +               return 0;

If the user specifies "-M0", should that error out or at least warn
the user that the value is non-sensical? What about -1, etc.? (These
are UX-related questions; the implementation obviously doesn't care
one way or the other.)

> diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
> @@ -766,6 +766,11 @@ test_expect_success 'grep -W shows no trailing empty lines' '
> +test_expect_success 'grep skips long lines' '
> +       git grep -M18 -W include >actual &&
> +       test_cmp expected actual
> +'

Meh, what is this actually testing? The output of "git grep -M18 -W
include" is no different that the output of "git grep -W include" in
the test just above this one. And, why is -W in this test at all? Its
presence confuses the reader into thinking that it is somehow
significant. I would have expected this test to look like this:

    cat >expected <<-\EOF
    hello.c:#include <stdio.h>
    EOF

    test_expect_success 'grep -M skips long lines' '
        git grep -M18 include >actual &&
        test_cmp expected actual
    '

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] grep: Add option --max-line-len
  2017-11-23 15:41 [PATCH] grep: Add option --max-line-len Marc-Antoine Ruel
  2017-11-23 19:24 ` Eric Sunshine
@ 2017-11-24  1:44 ` Junio C Hamano
       [not found]   ` <CAN+rsqmEWHhnQvktxsLJC2CkOQEmBL3b_xjRkEOHzV8W72zJew@mail.gmail.com>
  1 sibling, 1 reply; 5+ messages in thread
From: Junio C Hamano @ 2017-11-24  1:44 UTC (permalink / raw)
  To: Marc-Antoine Ruel; +Cc: git

Marc-Antoine Ruel <maruel@chromium.org> writes:

> This tells git grep to skip files longer than a specified length,
> which is often the result of generators and not actual source files.
>
> ...
> +-M<num>::
> +--max-line-len<num>::
> +	Match the pattern only for line shorter or equal to this length.
> +

All the excellent review comments from Eric I agree with.

With the name of the option and the above end-user facing
description, it is very clear that the only thing this feature does
is to declare that an overlong line does _not_ match when trying to
check against any pattern.

That is a much clearer definition and description than random new
features people propose here (and kicked back by reviewers, telling
them to make the specification clearer), and I'd commend you for that.

But it still leaves at least one thing unclear.  How should it
interact with "-v"?  If we consider an overlong line never matches,
would "git grep -v <pattern>" should include the line in its output?

Speaking of the output, it also makes me wonder if the feature
really wants to include an overlong line as a context line when
showing a near-by line that matches the pattern when -A/-B/-C/-W
options are in use. Even though it is clear that it does from the
above description, is it really the best thing the feature can do to
help the end users?

Which leads me to suspect that this "feature" might not be the ideal
you wanted to achive, but is an approximate substitution that you
found is "good enough" to simulate what the real thing you wanted to
do, especially when I go back and read the justfication in the
proposed log message that talks about "result of generators".

Isn't it a property of the entire file, not individual lines, if you
find it uninteresting to see reported by "git grep"?  I cannot shake
the suspicion that this feature happened to have ended up in this
shape, instead of "ignore a file with a line this long", only
because your starting point was to use "has overlong lines" as the
heuristic for "not interesting", and because "git grep" code is not
structured to first scan the entire file to decide if it is worth
working on it, and it is extra work to restructure the codeflow to
make it so (which you avoided).

If your real motivation was either

 (1) whether the file has or does not have the pattern for certain
     class of files are uninteresting; do not even run "grep"
     processing for them; or

 (2) hits or no-hits may be intereseting but output of overlong
     lines from certain class of files I do not wish to see;

then I can think of two alternatives.

For (1), can't we tell "result of generators" and other files with
pathspec?  If so, perhaps a negative pathspec can rescue.  e.g.

    git grep <pattern> -- '*.cc' ':!*-autogen.cc'

For (2), can't we model this after how users can tell "git diff"
that certain paths are not worth computing and showing textual
patches for, which is to Unset the 'diff' attribute?  When you have

    *-autogen.cc	-diff

in your .gitattributes, "git diff" would say "Binary files A and B
differ" instead of explaining line-by-line differences in the patch
form.  Perhaps we can also have a 'grep' attribute and squelch the
output if it is Unset?  

It is debatable but one could propose extending the use of existing
'diff' attribute to cover 'grep' too, with an argument that anything
not worth showing patch (i.e. 'diff' attribute is Unset) is not
worth showing grep hits from.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] grep: Add option --max-line-len
       [not found]   ` <CAN+rsqmEWHhnQvktxsLJC2CkOQEmBL3b_xjRkEOHzV8W72zJew@mail.gmail.com>
@ 2017-11-27  1:58     ` Marc-Antoine Ruel
  2017-11-27  2:08       ` Junio C Hamano
  0 siblings, 1 reply; 5+ messages in thread
From: Marc-Antoine Ruel @ 2017-11-27  1:58 UTC (permalink / raw)
  To: Junio C Hamano, Eric Sunshine; +Cc: git

[second try, now with text format]

Thanks a lot for the reviews. Replying to both.

If I send a follow up, I'll fix the commit description and the help
string, remove the shorthand -M, write a more sensible test.


2017-11-23 14:24 GMT-05:00 Eric Sunshine <sunshine@sunshineco.com>:
>> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
>> @@ -127,6 +128,10 @@ OPTIONS
>> +-M<num>::
>> +--max-line-len<num>::
>> +       Match the pattern only for line shorter or equal to this length.
>
> This documentation doesn't explain what it means by a line's length.
> This implementation seems to take into consideration only the line's
> byte count, however, given that displayed lines are normally
> tab-expanded, people might intuitively expect that expansion to count
> toward the length. A similar question arises for Unicode characters.
>
> Should this option take tab-expansion and Unicode into account?
> Regardless of the answer to that question, the documentation should
> make it clear what "line length" means.

Excellent question. I don't have an immediate answer.


>> diff --git a/grep.c b/grep.c
>> @@ -1273,6 +1275,8 @@ static int match_line(struct grep_opt *opt, char *bol, char *eol,
>> +       if (opt->max_line_length > 0 && eol-bol > opt->max_line_length)
>> +               return 0;
>
> If the user specifies "-M0", should that error out or at least warn
> the user that the value is non-sensical? What about -1, etc.? (These
> are UX-related questions; the implementation obviously doesn't care
> one way or the other.)

Precedent with -A is to ignore the negative value. I don't have a
strong opinion.


2017-11-23 20:44 GMT-05:00 Junio C Hamano <gitster@pobox.com>:
>
> Marc-Antoine Ruel <maruel@chromium.org> writes:
>
> > This tells git grep to skip files longer than a specified length,
> > which is often the result of generators and not actual source files.
> >
> > ...
> > +-M<num>::
> > +--max-line-len<num>::
> > +     Match the pattern only for line shorter or equal to this length.
> > +
>
> All the excellent review comments from Eric I agree with.
>
> With the name of the option and the above end-user facing
> description, it is very clear that the only thing this feature does
> is to declare that an overlong line does _not_ match when trying to
> check against any pattern.
>
> That is a much clearer definition and description than random new
> features people propose here (and kicked back by reviewers, telling
> them to make the specification clearer), and I'd commend you for that.
>
> But it still leaves at least one thing unclear.  How should it
> interact with "-v"?  If we consider an overlong line never matches,
> would "git grep -v <pattern>" should include the line in its output?

Ah! No idea. :/

> Speaking of the output, it also makes me wonder if the feature
> really wants to include an overlong line as a context line when
> showing a near-by line that matches the pattern when -A/-B/-C/-W
> options are in use. Even though it is clear that it does from the
> above description, is it really the best thing the feature can do to
> help the end users?
>
> Which leads me to suspect that this "feature" might not be the ideal
> you wanted to achive, but is an approximate substitution that you
> found is "good enough" to simulate what the real thing you wanted to
> do, especially when I go back and read the justfication in the
> proposed log message that talks about "result of generators".
>
> Isn't it a property of the entire file, not individual lines, if you
> find it uninteresting to see reported by "git grep"?  I cannot shake
> the suspicion that this feature happened to have ended up in this
> shape, instead of "ignore a file with a line this long", only
> because your starting point was to use "has overlong lines" as the
> heuristic for "not interesting", and because "git grep" code is not
> structured to first scan the entire file to decide if it is worth
> working on it, and it is extra work to restructure the codeflow to
> make it so (which you avoided).
>
> If your real motivation was either
>
>  (1) whether the file has or does not have the pattern for certain
>      class of files are uninteresting; do not even run "grep"
>      processing for them; or
>
>  (2) hits or no-hits may be intereseting but output of overlong
>      lines from certain class of files I do not wish to see;
>
> then I can think of two alternatives.
>
> For (1), can't we tell "result of generators" and other files with
> pathspec?  If so, perhaps a negative pathspec can rescue.  e.g.
>
>     git grep <pattern> -- '*.cc' ':!*-autogen.cc'
>
> For (2), can't we model this after how users can tell "git diff"
> that certain paths are not worth computing and showing textual
> patches for, which is to Unset the 'diff' attribute?  When you have
>
>     *-autogen.cc        -diff
>
> in your .gitattributes, "git diff" would say "Binary files A and B
> differ" instead of explaining line-by-line differences in the patch
> form.  Perhaps we can also have a 'grep' attribute and squelch the
> output if it is Unset?
>
> It is debatable but one could propose extending the use of existing
> 'diff' attribute to cover 'grep' too, with an argument that anything
> not worth showing patch (i.e. 'diff' attribute is Unset) is not
> worth showing grep hits from.

Thanks for the thoughtful analysis. My main motivation was (1), thus
filtering with a pathspec is much better than trying to work around
the issue. The issues raised in the review are significant enough that
committing this patch could cause significant issues; I don't know how
to resolve handling with -v and how to handle tabs.

After further thinking, what I'd like is a smarter version of the
git-gs shortcut wrapper that limits the search space on well known
extensions but I'd like it to also limit itself to "source-like"
files, similar in some ways to the -I flag. So in some ways this could
be better served as a git config but I'm not even sure what kind of
heuristic would be generic enough to be valuable to a large number of
users.

As such I'll drop the patch as I don't see a clear path forward with
the current one.

Thanks,

M-A

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] grep: Add option --max-line-len
  2017-11-27  1:58     ` Marc-Antoine Ruel
@ 2017-11-27  2:08       ` Junio C Hamano
  0 siblings, 0 replies; 5+ messages in thread
From: Junio C Hamano @ 2017-11-27  2:08 UTC (permalink / raw)
  To: Marc-Antoine Ruel; +Cc: Eric Sunshine, git

Marc-Antoine Ruel <maruel@chromium.org> writes:

> [second try, now with text format]
>
> Thanks a lot for the reviews. Replying to both.
>
> If I send a follow up, I'll fix the commit description and the help
> string, remove the shorthand -M, write a more sensible test.
>
> ...
>
> Thanks for the thoughtful analysis. My main motivation was (1), thus
> filtering with a pathspec is much better than trying to work around
> the issue. The issues raised in the review are significant enough that
> committing this patch could cause significant issues; I don't know how
> to resolve handling with -v and how to handle tabs.

OK.  Thanks for thinking it through.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-11-27  2:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-23 15:41 [PATCH] grep: Add option --max-line-len Marc-Antoine Ruel
2017-11-23 19:24 ` Eric Sunshine
2017-11-24  1:44 ` Junio C Hamano
     [not found]   ` <CAN+rsqmEWHhnQvktxsLJC2CkOQEmBL3b_xjRkEOHzV8W72zJew@mail.gmail.com>
2017-11-27  1:58     ` Marc-Antoine Ruel
2017-11-27  2:08       ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).