git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: git@vger.kernel.org
Cc: git-packagers@googlegroups.com, gitgitgadget@gmail.com,
	gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net,
	sandals@crustytoothpaste.net, szeder.dev@gmail.com,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: [PATCH v2 2/9] grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>"
Date: Fri, 28 Jun 2019 01:39:05 +0200	[thread overview]
Message-ID: <20190627233912.7117-3-avarab@gmail.com> (raw)
In-Reply-To: <20190626000329.32475-1-avarab@gmail.com>

Fix a bug introduced in 18547aacf5 ("grep/pcre: support utf-8",
2016-06-25) that was missed due to a blindspot in our tests, as
discussed in the previous commit. I then blindly copied the same bug
in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01) when
adding the PCRE v2 code.

We should not tell PCRE that we're processing UTF-8 just because we're
dealing with non-ASCII. In the case of e.g. "log --encoding=<...>"
under is_utf8_locale() the haystack might be in ISO-8859-1, and the
needle might be in a non-UTF-8 encoding.

Maybe we should be more strict here and die earlier? Should we also be
converting the needle to the encoding in question, and failing if it's
not a string that's valid in that encoding? Maybe.

But for now matching this as non-UTF8 at least has some hope of
producing sensible results, since we know that our default heuristic
of assuming the text to be matched is in the user locale encoding
isn't true when we've explicitly encoded it to be in a different
encoding.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c              | 8 ++++----
 grep.h              | 1 +
 revision.c          | 3 +++
 t/t4210-log-i18n.sh | 6 ++----
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/grep.c b/grep.c
index f7c3a5803e..1de4ab49c0 100644
--- a/grep.c
+++ b/grep.c
@@ -388,11 +388,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 	int options = PCRE_MULTILINE;
 
 	if (opt->ignore_case) {
-		if (has_non_ascii(p->pattern))
+		if (!opt->ignore_locale && has_non_ascii(p->pattern))
 			p->pcre1_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
 	}
-	if (is_utf8_locale() && has_non_ascii(p->pattern))
+	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern))
 		options |= PCRE_UTF8;
 
 	p->pcre1_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
@@ -498,14 +498,14 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 	p->pcre2_compile_context = NULL;
 
 	if (opt->ignore_case) {
-		if (has_non_ascii(p->pattern)) {
+		if (!opt->ignore_locale && has_non_ascii(p->pattern)) {
 			character_tables = pcre2_maketables(NULL);
 			p->pcre2_compile_context = pcre2_compile_context_create(NULL);
 			pcre2_set_character_tables(p->pcre2_compile_context, character_tables);
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (is_utf8_locale() && has_non_ascii(p->pattern))
+	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern))
 		options |= PCRE2_UTF;
 
 	p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
diff --git a/grep.h b/grep.h
index 1875880f37..4bb8a79d93 100644
--- a/grep.h
+++ b/grep.h
@@ -173,6 +173,7 @@ struct grep_opt {
 	int funcbody;
 	int extended_regexp_option;
 	int pattern_type_option;
+	int ignore_locale;
 	char colors[NR_GREP_COLORS][COLOR_MAXLEN];
 	unsigned pre_context;
 	unsigned post_context;
diff --git a/revision.c b/revision.c
index 621feb9df7..a842fb158a 100644
--- a/revision.c
+++ b/revision.c
@@ -28,6 +28,7 @@
 #include "commit-graph.h"
 #include "prio-queue.h"
 #include "hashmap.h"
+#include "utf8.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2655,6 +2656,8 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 
 	grep_commit_pattern_type(GREP_PATTERN_TYPE_UNSPECIFIED,
 				 &revs->grep_filter);
+	if (!is_encoding_utf8(get_log_output_encoding()))
+		revs->grep_filter.ignore_locale = 1;
 	compile_grep_patterns(&revs->grep_filter);
 
 	if (revs->reverse && revs->reflog_info)
diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh
index 86d22c1d4c..515bcb7ce1 100755
--- a/t/t4210-log-i18n.sh
+++ b/t/t4210-log-i18n.sh
@@ -59,10 +59,8 @@ test_expect_success 'log --grep does not find non-reencoded values (latin1)' '
 for engine in fixed basic extended perl
 do
 	prereq=
-	result=success
 	if test $engine = "perl"
 	then
-		result=failure
 		prereq="PCRE"
 	else
 		prereq=""
@@ -72,7 +70,7 @@ do
 	then
 	    force_regex=.*
 	fi
-	test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
+	test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
 		cat >expect <<-\EOF &&
 		latin1
 		utf8
@@ -86,7 +84,7 @@ do
 		test_must_be_empty actual
 	"
 
-	test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" "
+	test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" "
 		LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual &&
 		test_must_be_empty actual
 	"
-- 
2.22.0.455.g172b71a6c5


  parent reply	other threads:[~2019-06-27 23:39 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-13 11:49 [PATCH 0/4] Support building with GCC v8.x/v9.x Johannes Schindelin via GitGitGadget
2019-06-13 11:49 ` [PATCH 1/4] poll (mingw): allow compiling with GCC 8 and DEVELOPER=1 Johannes Schindelin via GitGitGadget
2019-06-13 11:49 ` [PATCH 2/4] kwset: allow building with GCC 8 Johannes Schindelin via GitGitGadget
2019-06-13 16:11   ` Junio C Hamano
2019-06-14  9:53   ` SZEDER Gábor
2019-06-14 10:00     ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream SZEDER Gábor
2019-06-14 10:00       ` [PATCH v1 1/4] " SZEDER Gábor
2019-06-14 10:00       ` [PATCH v1 2/4] SQUASH??? compat/obstack: fix portability issues SZEDER Gábor
2019-06-14 10:00       ` [PATCH v1 3/4] SQUASH??? compat/obstack: fix build errors with Clang SZEDER Gábor
2019-06-14 10:00       ` [PATCH v1 4/4] compat/obstack: fix some sparse warnings SZEDER Gábor
2019-06-14 17:57       ` [RFC/PATCH v1 0/4] compat/obstack: update from upstream Jeff King
2019-06-14 18:19       ` Junio C Hamano
2019-06-14 20:30       ` Ramsay Jones
2019-06-14 21:24         ` Ramsay Jones
2019-06-17 18:36         ` SZEDER Gábor
2019-06-14 16:12     ` [PATCH 2/4] kwset: allow building with GCC 8 Junio C Hamano
2019-06-17 18:26       ` SZEDER Gábor
2019-06-14 22:09   ` Ævar Arnfjörð Bjarmason
2019-06-14 22:55   ` Can we just get rid of kwset & obstack in favor of optimistically using PCRE v2 JIT? Ævar Arnfjörð Bjarmason
2019-06-14 23:19     ` Ævar Arnfjörð Bjarmason
2019-06-20 10:35       ` Jeff King
2019-06-15  9:01     ` Carlo Arenas
2019-06-15 19:15     ` brian m. carlson
2019-06-15 22:14       ` Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2 Ævar Arnfjörð Bjarmason
2019-06-26 14:02           ` Johannes Schindelin
2019-06-27  9:16             ` Johannes Schindelin
2019-06-27 16:27               ` Ævar Arnfjörð Bjarmason
2019-06-27 18:21                 ` Johannes Schindelin
2019-06-27 23:39           ` [PATCH v2 0/9] " Ævar Arnfjörð Bjarmason
2019-06-28  7:23             ` Ævar Arnfjörð Bjarmason
2019-06-28 16:10               ` Junio C Hamano
2019-07-01 21:20             ` [PATCH v3 00/10] " Ævar Arnfjörð Bjarmason
2019-07-01 21:31               ` Junio C Hamano
2019-07-02 11:10                 ` Ævar Arnfjörð Bjarmason
2019-07-02 12:32               ` Johannes Schindelin
2019-07-02 19:57                 ` Junio C Hamano
2019-07-03 10:08                   ` Johannes Schindelin
2019-07-03 10:25                 ` Johannes Schindelin
2019-07-03 11:27                   ` Johannes Schindelin
2019-07-01 21:20             ` [PATCH v3 01/10] log tests: test regex backends in "--encode=<enc>" tests Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 02/10] grep: don't use PCRE2?_UTF8 with "log --encoding=<non-utf8>" Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 03/10] t4210: skip more command-line encoding tests on MinGW Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 04/10] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 05/10] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 06/10] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 07/10] grep: make the behavior for NUL-byte in patterns sane Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 08/10] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
2019-07-01 21:20             ` [PATCH v3 09/10] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
2019-07-01 21:21             ` [PATCH v3 10/10] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 1/9] log tests: test regex backends in "--encode=<enc>" tests Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` Ævar Arnfjörð Bjarmason [this message]
2019-06-27 23:39           ` [PATCH v2 3/9] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 4/9] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 5/9] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 6/9] grep: make the behavior for NUL-byte in patterns sane Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 7/9] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 8/9] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
2019-06-27 23:39           ` [PATCH v2 9/9] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 1/7] grep: inline the return value of a function call used only once Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 2/7] grep tests: move "grep binary" alongside the rest Ævar Arnfjörð Bjarmason
2019-06-26 14:05           ` Johannes Schindelin
2019-06-26 18:13           ` Junio C Hamano
2019-06-26  0:03         ` [RFC/PATCH 3/7] grep tests: move binary pattern tests into their own file Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 4/7] grep: make the behavior for \0 in patterns sane Ævar Arnfjörð Bjarmason
2019-06-27  2:03           ` brian m. carlson
2019-06-26  0:03         ` [RFC/PATCH 5/7] grep: drop support for \0 in --fixed-strings <pattern> Ævar Arnfjörð Bjarmason
2019-06-26 16:14           ` Junio C Hamano
2019-06-26  0:03         ` [RFC/PATCH 6/7] grep: remove the kwset optimization Ævar Arnfjörð Bjarmason
2019-06-26  0:03         ` [RFC/PATCH 7/7] grep: use PCRE v2 for optimized fixed-string search Ævar Arnfjörð Bjarmason
2019-06-26 14:13           ` Johannes Schindelin
2019-06-26 18:45             ` Junio C Hamano
2019-06-27  9:31               ` Johannes Schindelin
2019-06-27 18:45                 ` Johannes Schindelin
2019-06-27 19:06                   ` Junio C Hamano
2019-06-28 10:56                     ` Johannes Schindelin
2019-06-13 11:49 ` [PATCH 3/4] winansi: simplify loading the GetCurrentConsoleFontEx() function Johannes Schindelin via GitGitGadget
2019-06-13 11:49 ` [PATCH 4/4] config: avoid calling `labs()` on too-large data type Johannes Schindelin via GitGitGadget
2019-06-13 16:13   ` Junio C Hamano
2019-06-16  6:48   ` René Scharfe
2019-06-16  8:24     ` René Scharfe
2019-06-16 14:01       ` René Scharfe
2019-06-16 22:26         ` Junio C Hamano
2019-06-20 19:58           ` René Scharfe
2019-06-20 21:07             ` Junio C Hamano
2019-06-21 18:35             ` Johannes Schindelin
2019-06-22 10:03               ` René Scharfe
2019-06-22 10:03           ` [PATCH v2 1/3] config: use unsigned_mult_overflows to check for overflows René Scharfe
2019-06-22 10:03           ` [PATCH v2 2/3] config: don't multiply in parse_unit_factor() René Scharfe
2019-06-22 10:03           ` [PATCH v2 3/3] config: simplify parsing of unit factors René Scharfe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190627233912.7117-3-avarab@gmail.com \
    --to=avarab@gmail.com \
    --cc=git-packagers@googlegroups.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=peff@peff.net \
    --cc=sandals@crustytoothpaste.net \
    --cc=szeder.dev@gmail.com \
    --subject='Re: [PATCH v2 2/9] grep: don'\''t use PCRE2?_UTF8 with "log --encoding=<non-utf8>"' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).