git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jonathan Nieder <jrnieder@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>, git@vger.kernel.org
Subject: Re: [PATCH] grep: Don't pass a TODO test if REG_STARTEND is supported
Date: Fri, 16 Jul 2010 20:51:24 +0000	[thread overview]
Message-ID: <AANLkTikDwJaBN8Y0814m6JaVab9BAXPx_VKE7Z_Q6hq7@mail.gmail.com> (raw)
In-Reply-To: <20100716195007.GC16371@burratino>

On Fri, Jul 16, 2010 at 19:50, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Ævar Arnfjörð Bjarmason wrote:
>
>> Well to clarify: The TAP is arguably right, although semantically
>> these sort of tests should probably be a SKIP on unsupported
>> platforms, not a passing TODO.
>
> No, we support all platforms people are willing to fix without
> uglifying the code too much.  So a bug is a bug.  Test
> prerequisites get used for behavior that is either out of scope
> (Posix-style permissions on Windows) or hard to test (signal
> delivery to child process in t7502-commit).
>
> The semantic problem you are describing here is that we have no
> separate way to mark bugs that are not consistently reproducible.
> A “fixed” test_expect_failure is sometimes a fluke, like in this
> example.
>
> If lucky, you can find an appropriate condition and use
> test_expect_success or test_expect_failure as appropriate.  In
> the general case, that is not always easy.  Better to eliminate the
> unreproducible bugs.

The failure is totally predicated on whether or not REG_STARTEND is
available on the system, so in a perfect world (where we couldn't
provide a compat regex library) we should detect whether REG_STARTEND
is defined in regex.h, and then stick it in GIT-BUILD-OPTIONS.

Then you could do:

    test_expect_success REG_STARTEND 'git grep ile a' '
        git grep ile a
    '

But it's much easier to just fix the replacement regex library so the
test works everywhere instead of detecting for REG_STARTEND
(e.g. using a helper).

>> On Thu, Jul 15, 2010 at 18:44, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
>>> We should also just upgrade the GNU regex library in compat/regex to
>>> the version that supports REG_STARTEND.
> [...]
>> This is what we should be focusing on
>
> By the way, I have no preference for choice of regex library here.  If
> something else is easier to get working correctly, that would be great.

The glibc one is probably pretty good as far as minimal POSIX DFA
engines go. Hopefully you can patch it up to get it to compile on
non-GNU systems.

More generally, the regex use in Git is something I've wanted to look
at more closely. Right now a bunch of Git tools use regexes in one
form or another, but they don't do so consistently:

    config.c:847:		  !regexec(store.value_regex, value, 0, NULL, 0)));
    config.c:1155:			if (regcomp(store.value_regex, value_regex,
    builtin/remote.c:1432:	if (regcomp(&old_regex, oldurl, REG_EXTENDED))
    builtin/remote.c:1436:		if (!regexec(&old_regex, urlset[i], 0, NULL, 0))
    builtin/blame.c:1963:	if (!(reg_error = regcomp(&regexp, spec + 1,
REG_NEWLINE)) &&
    builtin/blame.c:1964:	    !(reg_error = regexec(&regexp, line, 1,
match, 0))) {
    builtin/config.c:104:	if (use_key_regexp && regexec(key_regexp,
key_, 0, NULL, 0))
    builtin/config.c:107:	    (do_not_match ^ !!regexec(regexp,
(value_?value_:""), 0, NULL, 0)))
    builtin/config.c:177:		if (regcomp(key_regexp, key, REG_EXTENDED)) {
    builtin/config.c:190:		if (regcomp(regexp, regex_, REG_EXTENDED)) {
    builtin/apply.c:579:		if (regcomp(stamp, stamp_regexp, REG_EXTENDED)) {
    builtin/apply.c:586:	status = regexec(stamp, timestamp,
ARRAY_SIZE(m), m, 0);
    sha1_name.c:703:	if (regcomp(&regex, prefix, REG_EXTENDED))
    sha1_name.c:729:		if (!regexec(&regex, p + 2, 0, NULL, 0)) {
    diff.c:779:		if (!regexec(word_regex, buffer->ptr + *begin, 1, match, 0)) {
    diff.c:2011:				if (regcomp(ecbdata.diff_words->word_regex,
    xdiff-interface.c:276:		if (!regexec(&reg->re, line_buffer, 2, pmatch, 0)) {
    xdiff-interface.c:323:		if (regcomp(&reg->re, expression, cflags))
    diffcore-pickaxe.c:29:		while (*data && !regexec(regexp, data, 1,
&regmatch, flags)) {
    diffcore-pickaxe.c:62:		err = regcomp(&regex, needle, REG_EXTENDED
| REG_NEWLINE);
    grep.c:73:	err = regcomp(&p->regexp, p->pattern, opt->regflags);
    grep.c:375:	return regexec(preg, line, 1, match, eflags);
    http-backend.c:567:		if (regcomp(&re, c->pattern, REG_EXTENDED))
    http-backend.c:569:		if (!regexec(&re, dir, 1, out, 0)) {

Some of these are supplying REG_EXTENDED, some (like git-grep) allow
for passing REG_ICASE, some don't.

There's no way to do e.g. do a case insensitive git blame -L'/start/',
other than -L'/[sS][tT][aA][rR][tT]/' that is. It would be nice if we
could consistently pass regex flags, like -L'/start/i' in this
case. This also applies to features like the new ':/string' feature
added by Linus, maybe that should optionally be ':/string/i' (or
within the scope of the current implementation, ':!i/string' or
':!/string/i').

Regarding regular expression implementations. We might want to look
into bundling one implementation and using it everywhere, but I
haven't tested that to see if there are significant wins to be had.

There are regex implementations (notably GNU awk and GNU grep) that
are probably better fits for what Git does, a GNU grep backend for
git-log would probably search through revisions with a regex faster,
but I haven't tested it so I don't know if it's fast enough to bother
with it.

Using NFA engines like that also gives you some performance guarantees
(see [1][2]). Although that mostly matters in pathological
situations. But having Git make that guarantee might be useful,
e.g. so Gitweb can offer an interface to git-grep without having the
CPU on the host burned up by a pattern like (a*|a*)*.

The NFA engines we could use to get those features include Google's
RE2 (it's in C++, but we might fall back on e.g. the Plan9 engine),
GNU awk/grep, TRE, and probably some others.

But maybe this isn't something that's wanted or needed. Does anyone
else want a more unified regex interface in Git, or a determanistic
regex engine built-in?

1. http://swtch.com/~rsc/regexp/regexp1.html
2. http://stackoverflow.com/questions/1178173/regex-implementation-that-can-handle-machine-generated-regexs-non-backtracking

  reply	other threads:[~2010-07-16 20:51 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-08  0:42 [PATCH] grep: Don't pass a TODO test if REG_STARTEND is supported Ævar Arnfjörð Bjarmason
2010-07-08 19:40 ` Junio C Hamano
2010-07-08 20:09   ` Ævar Arnfjörð Bjarmason
2010-07-08 21:58     ` René Scharfe
2010-07-15 15:32       ` Ævar Arnfjörð Bjarmason
2010-07-15 17:47     ` Junio C Hamano
2010-07-15 18:44       ` Ævar Arnfjörð Bjarmason
     [not found]         ` <20100715220059.GA3312@burratino>
2010-07-16 13:58           ` [RFC/PATCH] Update compat/regex Ævar Arnfjörð Bjarmason
2010-07-16 14:17             ` Andreas Schwab
2010-08-15 11:08               ` Ævar Arnfjörð Bjarmason
2010-08-16 12:26                 ` Paolo Bonzini
2010-08-17  3:25                 ` [PATCH/RFC 0/3] " Ævar Arnfjörð Bjarmason
2010-08-17  3:25                 ` [PATCH/RFC 2/3] compat/regex: hacks to get the gawk regex engine to compile within git Ævar Arnfjörð Bjarmason
2010-08-17  3:35                   ` Jonathan Nieder
2010-08-17  3:25                 ` [PATCH/RFC 3/3] t/t7008-grep-binary.sh: un-TODO a test that needs REG_STARTEND Ævar Arnfjörð Bjarmason
     [not found]                 ` <1282015548-19074-2-git-send-email-avarab@gmail.com>
2010-08-17  3:37                   ` [PATCH/RFC 1/3] compat/regex: use the regex engine from gawk for compat Jonathan Nieder
2010-08-17  3:50                     ` Ævar Arnfjörð Bjarmason
2010-08-17  4:08                       ` Jonathan Nieder
2010-08-17  5:17                 ` [PATCH/RFC v2 0/3] Update compat/regex Ævar Arnfjörð Bjarmason
2010-08-17  8:03                   ` Jonathan Nieder
2010-08-17  9:24                     ` [PATCH 0/5] " Ævar Arnfjörð Bjarmason
2010-08-17 11:46                       ` Paolo Bonzini
2010-08-17 23:19                       ` Junio C Hamano
2010-08-17 23:50                         ` Jonathan Nieder
2010-08-18 10:41                           ` Ævar Arnfjörð Bjarmason
2010-08-17  9:24                     ` [PATCH 2/5] compat/regex: get the gawk regex engine to compile within git Ævar Arnfjörð Bjarmason
2010-08-17  9:24                     ` [PATCH 3/5] Change regerror() declaration from K&R style to ANSI C (C89) Ævar Arnfjörð Bjarmason
2010-08-17  9:24                     ` [PATCH 4/5] t/t7008-grep-binary.sh: un-TODO a test that needs REG_STARTEND Ævar Arnfjörð Bjarmason
2010-08-17  9:24                     ` [PATCH 5/5] autoconf: don't use platform regex if it lacks REG_STARTEND Ævar Arnfjörð Bjarmason
2010-08-17  5:17                 ` [PATCH/RFC v2 2/3] compat/regex: get the gawk regex engine to compile within git Ævar Arnfjörð Bjarmason
2010-08-17  5:17                 ` [PATCH/RFC v2 3/3] t/t7008-grep-binary.sh: un-TODO a test that needs REG_STARTEND Ævar Arnfjörð Bjarmason
2010-07-16 14:33         ` [PATCH] grep: Don't pass a TODO test if REG_STARTEND is supported Ævar Arnfjörð Bjarmason
2010-07-16 19:50           ` Jonathan Nieder
2010-07-16 20:51             ` Ævar Arnfjörð Bjarmason [this message]
2010-07-16 21:06               ` Jonathan Nieder
2010-07-16 21:19                 ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AANLkTikDwJaBN8Y0814m6JaVab9BAXPx_VKE7Z_Q6hq7@mail.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).