From: "René Scharfe" <l.s.r@web.de>
To: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>, git@vger.kernel.org
Cc: plamen.totev@abv.bg
Subject: Re: [PATCH] grep: use regcomp() for icase search with non-ascii patterns
Date: Mon, 06 Jul 2015 22:10:33 +0200 [thread overview]
Message-ID: <559AE0B9.2040704@web.de> (raw)
In-Reply-To: <1436186551-32544-1-git-send-email-pclouds@gmail.com>
Am 06.07.2015 um 14:42 schrieb Nguyễn Thái Ngọc Duy:
> Noticed-by: Plamen Totev <plamen.totev@abv.bg>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
> grep.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/grep.c b/grep.c
> index b58c7c6..48db15a 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -378,7 +378,7 @@ static void free_pcre_regexp(struct grep_pat *p)
> }
> #endif /* !USE_LIBPCRE */
>
> -static int is_fixed(const char *s, size_t len)
> +static int is_fixed(const char *s, size_t len, int ignore_icase)
> {
> size_t i;
>
> @@ -391,6 +391,13 @@ static int is_fixed(const char *s, size_t len)
> for (i = 0; i < len; i++) {
> if (is_regex_special(s[i]))
> return 0;
> + /*
> + * The builtin substring search can only deal with case
> + * insensitivity in ascii range. If there is something outside
> + * of that range, fall back to regcomp.
> + */
> + if (ignore_icase && (unsigned char)s[i] >= 128)
> + return 0;
How about "isascii(s[i])"?
> }
>
> return 1;
> @@ -398,18 +405,19 @@ static int is_fixed(const char *s, size_t len)
>
> static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
> {
> + int ignore_icase = opt->regflags & REG_ICASE || p->ignore_case;
> int err;
>
> p->word_regexp = opt->word_regexp;
> p->ignore_case = opt->ignore_case;
Using p->ignore_case before this line, as in initialization of the new
variable ignore_icase above, changes the meaning.
>
> - if (opt->fixed || is_fixed(p->pattern, p->patternlen))
> + if (opt->fixed || is_fixed(p->pattern, p->patternlen, ignore_icase))
> p->fixed = 1;
> else
> p->fixed = 0;
>
> if (p->fixed) {
> - if (opt->regflags & REG_ICASE || p->ignore_case)
> + if (ignore_case)
ignore_icase instead? ignore_case is for the config variable
core.ignorecase. Tricky.
> p->kws = kwsalloc(tolower_trans_tbl);
> else
> p->kws = kwsalloc(NULL);
>
So the optimization before this patch was that if a string was searched
for without -F then it would be treated as a fixed string anyway unless
it contained regex special characters. Searching for fixed strings
using the kwset functions is faster than using regcomp and regexec,
which makes the exercise worthwhile.
Your patch disables the optimization if non-ASCII characters are
searched for because kwset handles case transformations only for ASCII
chars.
Another consequence of this limitation is that -Fi (explicit
case-insensitive fixed-string search) doesn't work properly with
non-ASCII chars neither. How can we handle this one? Fall back to
regcomp by escaping all special characters? Or at least warn?
Tests would be nice. :)
René
next prev parent reply other threads:[~2015-07-06 20:11 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-06 11:28 Git grep does not support multi-byte characters (like UTF-8) Plamen Totev
2015-07-06 12:23 ` Duy Nguyen
2015-07-07 8:58 ` Plamen Totev
2015-07-07 12:22 ` Duy Nguyen
2015-07-07 16:07 ` Junio C Hamano
2015-07-07 18:08 ` Plamen Totev
2015-07-08 2:19 ` Duy Nguyen
2015-07-08 4:52 ` Junio C Hamano
2015-07-06 12:42 ` [PATCH] grep: use regcomp() for icase search with non-ascii patterns Nguyễn Thái Ngọc Duy
2015-07-06 20:10 ` René Scharfe [this message]
2015-07-06 23:02 ` Duy Nguyen
2015-07-07 14:25 ` Plamen Totev
2015-07-08 10:38 ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
2015-07-08 10:38 ` [PATCH v2 1/9] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2015-07-08 10:38 ` [PATCH v2 2/9] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2015-07-08 10:38 ` [PATCH v2 3/9] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2015-07-08 10:38 ` [PATCH v2 4/9] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2015-07-08 10:38 ` [PATCH v2 5/9] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2015-07-08 11:00 ` Duy Nguyen
2015-07-08 10:38 ` [PATCH v2 6/9] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2015-07-08 10:38 ` [PATCH v2 7/9] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2015-07-11 8:07 ` Plamen Totev
2015-07-08 10:38 ` [PATCH v2 8/9] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2015-07-08 10:38 ` [PATCH v2 9/9] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
2015-07-09 22:55 ` Eric Sunshine
2015-07-08 11:32 ` [PATCH v2 0/9] icase " Torsten Bögershausen
2015-07-08 12:13 ` Duy Nguyen
2015-07-08 15:36 ` Junio C Hamano
2015-07-08 23:28 ` Duy Nguyen
2015-07-14 13:24 ` [PATCH v3 " Nguyễn Thái Ngọc Duy
2015-07-14 13:24 ` [PATCH v3 1/9] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2015-07-14 13:24 ` [PATCH v3 2/9] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2015-07-14 13:24 ` [PATCH v3 3/9] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2015-07-14 13:24 ` [PATCH v3 4/9] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2015-07-14 13:24 ` [PATCH v3 5/9] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2015-07-14 13:24 ` [PATCH v3 6/9] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2015-07-14 13:24 ` [PATCH v3 7/9] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2015-07-14 13:24 ` [PATCH v3 8/9] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2015-07-14 13:24 ` [PATCH v3 9/9] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
2015-07-14 16:42 ` [PATCH v3 0/9] icase " Torsten Bögershausen
2015-07-15 9:39 ` Duy Nguyen
2015-07-15 19:51 ` Torsten Bögershausen
2015-08-21 12:47 ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
2015-08-21 12:47 ` [PATCH v4 01/10] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2015-08-21 12:47 ` [PATCH v4 02/10] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2015-08-21 12:47 ` [PATCH v4 03/10] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
2015-08-21 12:47 ` [PATCH v4 04/10] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2015-08-21 12:47 ` [PATCH v4 05/10] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2015-08-21 12:47 ` [PATCH v4 06/10] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2015-08-21 12:47 ` [PATCH v4 07/10] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2015-08-21 12:47 ` [PATCH v4 08/10] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2015-08-21 12:47 ` [PATCH v4 09/10] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2015-08-21 12:47 ` [PATCH v4 10/10] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=559AE0B9.2040704@web.de \
--to=l.s.r@web.de \
--cc=git@vger.kernel.org \
--cc=pclouds@gmail.com \
--cc=plamen.totev@abv.bg \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).