git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Duy Nguyen <pclouds@gmail.com>
To: Plamen Totev <plamen.totev@abv.bg>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: Git grep does not support multi-byte characters (like UTF-8)
Date: Tue, 7 Jul 2015 19:22:28 +0700	[thread overview]
Message-ID: <CACsJy8A035JsR-z34g56QWLVkN0Tg0tOqmrT+Hp6ecsVBhm25g@mail.gmail.com> (raw)
In-Reply-To: <775251698.1328032.1436259534851.JavaMail.apache@nm31.abv.bg>

On Tue, Jul 7, 2015 at 3:58 PM, Plamen Totev <plamen.totev@abv.bg> wrote:
> Nguyen, thanks for the help and the patch. Also the escaping suggested by Scharfe seems as good choice. But i dig some more into the problem and I found some other thing. That's why I replied on the main thread not on the patch. I hope you'll excuse me if this is a bad practice.

So far this is very good reporting. I can't complain :)

> git grep -i -P also does not works because the PCRE_UTF8 is not set and pcre library does not treat the string as UTF-8.

We do prefer utf-8, but i don't know if we can assume utf-8 everywhere
yet. I guess it's ok in this case.

> pickaxe search also uses kwsearch so the case insensitive search with it does not work (e.g. git log -i -S).  Maybe this is a less of a problem here as one is expected to search for exact string (hence knows the case)

Will fix (i'm close to being done with git-grep, not counting the pcre
bug you just found)

> There is a interesting corner case. is_fixed treats all patterns containing nulls as fixed. So what about if the string contains non-ASCII symbols as well as nulls and the search is case insensitive :) I have to admin that my knowledge in UTF-8 is not enough to answer the question if this could occur during normal usage. For example the second byte in multi-byte symbol is NULL. I would guess that's not true as it would break a lot of programs that depend on NULL delimited string but it's good if somebody could confirm.

For utf-8, if NUL occurs in a byte stream, it must be ASCII NUL, not
part of any multibyte character. Utf-8 is really well tuned for C
strings.

> GNU grep indeed uses escaped regular expressions when the string is using multi-byte encoding and the search is case insensitive. If the encoding is UTF-8 then this strategy could be used in git too. Especially that git already have support and helper functions to work with UTF-8. As for the other multi-byte encodings - I think the things would become more complicated. As far I know in UTF-8 the '{' character for example is two bytes not one. Maybe really a support could be added only for the UTF-8 and if the string is not UTF-8 to issue a warning.

In the worst case we could reuse the trick we do elsewhere in git:
convert to utf-8 with iconv, do whatever we need to (escaping...) then
convert back before feeding it to regcomp and friends.

> So maybe the following makes sense when a grep search is performed:
> * check if the multi-byte encoding is used. If it's and the search is case insensitive and the encoding is not UTF-8 give a warning;
> * if pcre is used and the string is UTF-8 encoded set the PCRE_UTF8 flag;
> * if the search is case insensitive, the string is fixed and the encoding  used is UTF-8 use regcomp instead of kwsearch and escape any regex special characters in the pattern;
>
> And the question with the behavior of pickaxe search remains open. Using kwset does not work with case insensitive non-ASCII searches. Instead of fixing grep.c maybe it's better if new function is introduced that performs keyword searches so it could be used by both grep, diffcore-pickaxe and any other code in the future that may require such functionality. Or maybe diffcore-pickaxe should use grep instead of directly kwset/regcomp

That would function be called "grep". More refactor would be needed.
"git grep regcomp" reveals some many places. Many some of them would
benefit from kws if we provide this new function you mentioned.

> Regards,
> Plamen Totev
>
>
>
>>-------- Оригинално писмо --------
>>От: Duy Nguyen pclouds@gmail.com
>>Относно: Re: Git grep does not support multi-byte characters (like UTF-8)
>>До: Plamen Totev <plamen.totev@abv.bg>
>>Изпратено на: 06.07.2015 15:23
>
>> I think we over-optimized a bit. If you your system provides regex
>> with locale support (e.g. Linux) and you don't explicitly use fallback
>> regex implementation, it should work. I suppose your test patterns
>> look "fixed" (i.e. no regex special characters)? Can you try just add
>> "." and see if case insensitive matching works?
>
> This is indeed the problem. When I added the "." the matching works just fine.



-- 
Duy

  reply	other threads:[~2015-07-07 12:23 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-06 11:28 Git grep does not support multi-byte characters (like UTF-8) Plamen Totev
2015-07-06 12:23 ` Duy Nguyen
2015-07-07  8:58   ` Plamen Totev
2015-07-07 12:22     ` Duy Nguyen [this message]
2015-07-07 16:07     ` Junio C Hamano
2015-07-07 18:08       ` Plamen Totev
2015-07-08  2:19         ` Duy Nguyen
2015-07-08  4:52           ` Junio C Hamano
2015-07-06 12:42 ` [PATCH] grep: use regcomp() for icase search with non-ascii patterns Nguyễn Thái Ngọc Duy
2015-07-06 20:10   ` René Scharfe
2015-07-06 23:02     ` Duy Nguyen
2015-07-07 14:25       ` Plamen Totev
2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 1/9] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 2/9] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 3/9] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 4/9] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 5/9] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2015-07-08 11:00       ` Duy Nguyen
2015-07-08 10:38     ` [PATCH v2 6/9] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 7/9] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2015-07-11  8:07       ` Plamen Totev
2015-07-08 10:38     ` [PATCH v2 8/9] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 9/9] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
2015-07-09 22:55       ` Eric Sunshine
2015-07-08 11:32     ` [PATCH v2 0/9] icase " Torsten Bögershausen
2015-07-08 12:13       ` Duy Nguyen
2015-07-08 15:36     ` Junio C Hamano
2015-07-08 23:28       ` Duy Nguyen
2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 1/9] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 2/9] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 3/9] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 4/9] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 5/9] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 6/9] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 7/9] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 8/9] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 9/9] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
2015-07-14 16:42       ` [PATCH v3 0/9] icase " Torsten Bögershausen
2015-07-15  9:39         ` Duy Nguyen
2015-07-15 19:51           ` Torsten Bögershausen
2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 01/10] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 02/10] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 03/10] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 04/10] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 05/10] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 06/10] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 07/10] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 08/10] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 09/10] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 10/10] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACsJy8A035JsR-z34g56QWLVkN0Tg0tOqmrT+Hp6ecsVBhm25g@mail.gmail.com \
    --to=pclouds@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=plamen.totev@abv.bg \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).