git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Hamza Mahfooz <someguy@effective-light.com>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>,
	Eric Sunshine <sunshine@sunshineco.com>
Subject: Re: [PATCH v10 2/2] pretty: colorize pattern matches in commit messages
Date: Tue, 05 Oct 2021 18:29:21 +0200	[thread overview]
Message-ID: <87v92bju64.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20210929115716.10364-2-someguy@effective-light.com>


On Wed, Sep 29 2021, Hamza Mahfooz wrote:

> The "git log" command limits its output to the commits that contain strings
> matched by a pattern when the "--grep=<pattern>" option is used, but unlike
> output from "git grep -e <pattern>", the matches are not highlighted,
> making them harder to spot.
>
> Teach the pretty-printer code to highlight matches from the
> "--grep=<pattern>", "--author=<pattern>" and "--committer=<pattern>"
> options (to view the last one, you may have to ask for --pretty=fuller).
>
> Also, it must be noted that we are effectively greping the content twice,
> however it only slows down "git log --author=^H" on this repository by
> around 1-2% (compared to v2.33.0), so it should be a small enough slow
> down to justify the addition of the feature.

I tested the performance of this, that part looks good. Just for future
reference, is the running it twice just because it's a hassle to pass up
the match data or do the equivalent of a /g match on the first pass, and
doing this purely in the output emitter is easier?

There's a bug in how this highlighter handles UTF-8. I.e. with "grep" in
general we don't turn on UTF-8 unless the pattern itself is
UTF-8. Leading to this edge case:
    
    $ git grep Ævar -- CODE_OF_CONDUCT.md
    CODE_OF_CONDUCT.md:  - Ævar Arnfjörð Bjarmason <avarab@gmail.com>
    $ git grep .var -- CODE_OF_CONDUCT.md
    CODE_OF_CONDUCT.md:  - <C3><86>var Arnfjörð Bjarmason <avarab@gmail.com>

See 95ca1f987ed (grep/pcre2: better support invalid UTF-8 haystacks,
2021-01-24) for the rabbit hole of why we can't just turn that on. We
need to be able to grep arbitrary binary crap by default.

But for "log" we can be much stricter with encodings, and we've got both
i18n.commitEncoding and i18n.logOutputEncoding in play. This should pay
attention to those.

In theory any such bugs pre-date this series, but in practice they
don't. We'd match in non-UTF8 PCRE mode before, which has some edge
cases, but since we're not highlighting anything not knowing character
boundaries usually doesn't matter.

But with this we've got the same edge case in the log options now:
    
    $ ~/g/git/git log --color --author='.var.*ö.*Bjar' -1 origin/master  | grep ^Author
    Author: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
    $ ~/g/git/git log --color --author='.var.*Bjar' -1 origin/master  | grep ^Author
    grep: (standard input): binary file matches

I think just something like the below should fix this all up. It passes
all tests, although more would need to be added showing how this new
coloring mode handles utf8/non-utf8, and in combination with
e.g. i18n.commitEncoding=latin1 (which will emit "binary", at least
according to my otherwise UTF-8 GNU grep):

If we skip that "!has_non_ascii(p->pattern)' check one test fails, but
maybe that test is just stupid and we should die on that input
anyway. I.e. if we've said we're emitting UTF-8 shouldn't we be
validating/converting that somewhere before we feed it to PCRE and
friends? Maybe there's a good reason not to (and as I type this I
realize performance might suck, although "is ascii?" is fairly cheap,
and you'd only need to do it on !ASCII data).

diff --git a/grep.c b/grep.c
index fe847a0111a..ff09fbdd6cf 100644
--- a/grep.c
+++ b/grep.c
@@ -382,8 +382,9 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
-	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
+	if ((opt->utf8_all_the_things && !has_non_ascii(p->pattern) ) ||
+	    (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
+	    !(!opt->ignore_case && (p->fixed || p->is_fixed))))
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
 
 #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER
diff --git a/grep.h b/grep.h
index 808ad76f0c5..c9ddd637d18 100644
--- a/grep.h
+++ b/grep.h
@@ -167,6 +167,7 @@ struct grep_opt {
 	int extended_regexp_option;
 	int pattern_type_option;
 	int ignore_locale;
+	int utf8_all_the_things;
 	char colors[NR_GREP_COLORS][COLOR_MAXLEN];
 	unsigned pre_context;
 	unsigned post_context;
diff --git a/revision.c b/revision.c
index 0dabb5a0bcf..0d751dceb7e 100644
--- a/revision.c
+++ b/revision.c
@@ -2874,6 +2874,8 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 				 &revs->grep_filter);
 	if (!is_encoding_utf8(get_log_output_encoding()))
 		revs->grep_filter.ignore_locale = 1;
+	else
+		revs->grep_filter.utf8_all_the_things = 1;
 	compile_grep_patterns(&revs->grep_filter);
 
 	if (revs->reverse && revs->reflog_info)

      reply	other threads:[~2021-10-05 16:47 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-29 11:57 [PATCH v10 1/2] grep: refactor next_match() and match_one_pattern() for external use Hamza Mahfooz
2021-09-29 11:57 ` [PATCH v10 2/2] pretty: colorize pattern matches in commit messages Hamza Mahfooz
2021-10-05 16:29   ` Ævar Arnfjörð Bjarmason [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87v92bju64.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=someguy@effective-light.com \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).