git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	Carlo Arenas <carenas@gmail.com>,
	git@vger.kernel.org
Subject: Re: [PATCH] grep: skip UTF8 checks explicitally
Date: Fri, 26 Jul 2019 09:19:46 -0700	[thread overview]
Message-ID: <xmqqy30kojj1.fsf@gitster-ct.c.googlers.com> (raw)
In-Reply-To: <87ef2c7roy.fsf@evledraar.gmail.com> ("Ævar Arnfjörð Bjarmason"'s message of "Fri, 26 Jul 2019 17:15:25 +0200")

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> FWIW what I meant was not that we'd run around and iconv() things, it
> wouldn't make much sense to e.g. iconv() some PNG data to be "UTF-8
> valid", which presumably would be the end result of something like that.
>
> Rather that this model of assuming that a UTF-8 pattern means we can
> consider everything in the repo UTF-8 in git-grep doesn't make sense. My
> kwset patches *revealed* that problem in a painful way, but it was there
> already.

We already do assume that pathnames are UTF-8 (pathspecs on MacOS
are converted and then they are matched assuming that property).
Further, with the same mechanism, I think there is an assumption
that anything that comes from the command line is UTF-8 (and if I
recall correctly, doesn't the Windows port of Git force us to use
the same assumption---I recall we needed tests tweak for that).

In the very very longer term, I do not think we would want to keep
the assumption that the text encoding of blobs is always UTF-8, and
it would be nice to extend the system, so that blob data could be
marked in some way to say "I'm in Big-5, and not in UTF-8, so please
treat me as such" and magically the needle and the haystack can be
made to agree, with iconv() either one of them.  

But I do not think the current topic to fix the immediate/imminent
breakage should not be distracted by that.  Let's keep assuming that
any blob, when it is text, is UTF-8.

And from that point of view, I think the two pieces of idea in your
earlier message does make sense.  We can try to match as binary most
of the time, as UTF-8 would not let a valid UTF-8 needle match in
the haystack starting in the middle of a character.  When the user
is trying to match case-insensitively, we know the haystack in which
the user is interested in finding the needle is text, even though
there may be non-text blobs as well.

For example, "git grep -i 'foo' t/" may find a few png files under
the t/ directory.  We do not care if they happen to contain Foo and
we do not mind if they appear or do not appear in the result.  The
only two things we care about are (1) foo, Foo, FOO are found in the
text files under t/ and (2) the command does not die in the middle,
before processing all the files, only because a png file it found
were not UTF-8 valid.

  parent reply	other threads:[~2019-07-26 16:19 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-21 18:31 [PATCH] grep: skip UTF8 checks explicitally Carlo Marcelo Arenas Belón
2019-07-21 18:34 ` Eric Sunshine
2019-07-22 11:55 ` Johannes Schindelin
2019-07-22 14:43   ` [PATCH] grep: skip UTF8 checks explicitly Carlo Marcelo Arenas Belón
2019-07-24  2:10     ` Junio C Hamano
2019-07-24 10:50       ` Johannes Schindelin
2019-07-24 16:08         ` Junio C Hamano
2019-08-28 14:54           ` [PATCH v2] " Carlo Marcelo Arenas Belón
2019-08-28 16:57             ` Carlo Arenas
2019-09-09 15:07               ` Carlo Arenas
2019-09-09 18:49                 ` Junio C Hamano
2019-07-22 19:42   ` [PATCH] grep: skip UTF8 checks explicitally Ævar Arnfjörð Bjarmason
2019-07-23  3:50     ` Carlo Arenas
2019-07-23 12:46       ` Johannes Schindelin
2019-07-24  1:47         ` Carlo Arenas
2019-07-24 10:47           ` Johannes Schindelin
2019-07-24 18:22             ` Ævar Arnfjörð Bjarmason
2019-07-24 21:04               ` Junio C Hamano
2019-07-25  9:48                 ` Johannes Schindelin
2019-07-25 13:02                   ` Junio C Hamano
2019-07-25 18:22                     ` Johannes Schindelin
2019-07-26 15:15                       ` Ævar Arnfjörð Bjarmason
2019-07-26 15:53                         ` Carlo Arenas
2019-07-26 20:05                           ` Ævar Arnfjörð Bjarmason
2019-07-26 16:19                         ` Junio C Hamano [this message]
2019-07-26 19:40                           ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqy30kojj1.fsf@gitster-ct.c.googlers.com \
    --to=gitster@pobox.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=avarab@gmail.com \
    --cc=carenas@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).