git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Duy Nguyen <pclouds@gmail.com>
To: "Torsten Bögershausen" <tboegi@web.de>
Cc: "Git Mailing List" <git@vger.kernel.org>,
	"Plamen Totev" <plamen.totev@abv.bg>,
	"René Scharfe" <l.s.r@web.de>,
	"Junio C Hamano" <gitster@pobox.com>
Subject: Re: [PATCH v2 0/9] icase match on non-ascii
Date: Wed, 8 Jul 2015 19:13:27 +0700	[thread overview]
Message-ID: <CACsJy8DPVrVBrnmq2SBDkR+TNgaDBgnZGseHwR82ap2+6yR+rw@mail.gmail.com> (raw)
In-Reply-To: <559D0A4F.9040307@web.de>

On Wed, Jul 8, 2015 at 6:32 PM, Torsten Bögershausen <tboegi@web.de> wrote:
> On 2015-07-08 12.38, Nguyễn Thái Ngọc Duy wrote:
>> Side note, I almost added the third has_non_ascii() function. Maybe we
>> should consider merging the two existing has_non_ascii() functions
>> back, or rename one to something else.
>>
> Side question:
>
> has_non_ascii can mean different things:
>  UTF-8, ISO-8859-1, ISO-8859-x...
>
> In short: everything.
> Should we be more critical here ?

For the purpose of this series, no. What we need to ask is "can we
optimize this?" and that leads to "ok we can safely optimize for
ascii-only patterns, don't optimize otherwise".

> May be this can be used from commit.c:
> static int verify_utf8(struct strbuf *buf)

Except the pcre/utf8 patch, the rest should work for all single-byte encodings..

> Other question:
> Should the "non-ascii" characters in the test scripts be octal-escaped ?

I would prefer something readable from a text editor. Even though I
don't speak Icelandic (the strings were copied from gettext test
script) I can see uppercase/lowercase of letters. But I don't know how
many people have fonts covering more than just ascii..

> Third question:
> What happens on systems, which don't have gettext, (for whatever reasons)
> --- a/gettext.c
> +++ b/gettext.c
> @@ -166,12 +166,17 @@ void git_setup_gettext(void)
> textdomain("git");
> }
>
> +int is_utf8_locale(void)
> +{
> + return !strcmp(charset, "UTF-8");
> +}

Hm.. pcre on utf-8 is screwed. I really don't want to go through
nl_langinfo, $LC_ALL, $LC_CTYPE and $LANG to detect utf8 like what's
done in compat/regex/regcomp.c.. But I guess there's no other way,
people can disable gettext and expect git to work properly with utf-8
if their pcre library supports it.

> 4th question:
> What happens on systems which don't have locale support at all ?

I suppose by locale here you do not mean "gettext" any more, then
icase does not work. We simply delegate the work to system regex/pcre.
If they don't support locale, nothing we can do. Git itself does not
know how to fold case outside ascii (except maybe utf8, I don't know
how smart our utf-8 impl is). gettext support does not matter here,
except pcre/utf8 case above.

> As one may suspect, I'm not a friend of being dependent on gettext and/or
> locale, at least not for this kind of business.
>
> Would it make more sense to have a command line option ?

I assume you only care about utf-8 here (and utf8.c knows how to fold
case), you'll need to improve compat regex to take advantage of it
first, maybe make sure it can live side by side with system regex,
also make sure pcre is compiled with utf-8 support if you use pcre.
And if setting LANG to let git know you want to use utf-8 is too
subtle, then yes a command line option makes sense :)
-- 
Duy

  reply	other threads:[~2015-07-08 12:14 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-06 11:28 Git grep does not support multi-byte characters (like UTF-8) Plamen Totev
2015-07-06 12:23 ` Duy Nguyen
2015-07-07  8:58   ` Plamen Totev
2015-07-07 12:22     ` Duy Nguyen
2015-07-07 16:07     ` Junio C Hamano
2015-07-07 18:08       ` Plamen Totev
2015-07-08  2:19         ` Duy Nguyen
2015-07-08  4:52           ` Junio C Hamano
2015-07-06 12:42 ` [PATCH] grep: use regcomp() for icase search with non-ascii patterns Nguyễn Thái Ngọc Duy
2015-07-06 20:10   ` René Scharfe
2015-07-06 23:02     ` Duy Nguyen
2015-07-07 14:25       ` Plamen Totev
2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 1/9] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 2/9] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 3/9] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 4/9] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 5/9] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2015-07-08 11:00       ` Duy Nguyen
2015-07-08 10:38     ` [PATCH v2 6/9] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 7/9] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2015-07-11  8:07       ` Plamen Totev
2015-07-08 10:38     ` [PATCH v2 8/9] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 9/9] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
2015-07-09 22:55       ` Eric Sunshine
2015-07-08 11:32     ` [PATCH v2 0/9] icase " Torsten Bögershausen
2015-07-08 12:13       ` Duy Nguyen [this message]
2015-07-08 15:36     ` Junio C Hamano
2015-07-08 23:28       ` Duy Nguyen
2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 1/9] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 2/9] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 3/9] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 4/9] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 5/9] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 6/9] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 7/9] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 8/9] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 9/9] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
2015-07-14 16:42       ` [PATCH v3 0/9] icase " Torsten Bögershausen
2015-07-15  9:39         ` Duy Nguyen
2015-07-15 19:51           ` Torsten Bögershausen
2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 01/10] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 02/10] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 03/10] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 04/10] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 05/10] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 06/10] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 07/10] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 08/10] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 09/10] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 10/10] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACsJy8DPVrVBrnmq2SBDkR+TNgaDBgnZGseHwR82ap2+6yR+rw@mail.gmail.com \
    --to=pclouds@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=plamen.totev@abv.bg \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).