git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Mathias Krause <minipli@grsecurity.net>
To: Junio C Hamano <gitster@pobox.com>, Stephane Odul <stephane@clumio.com>
Cc: git@vger.kernel.org,
	"Carlo Marcelo Arenas Belón" <carenas@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: Suspected git grep regression in git 2.40.0
Date: Tue, 21 Mar 2023 20:20:42 +0100	[thread overview]
Message-ID: <b0f4b588-9871-8e59-e5a2-3f8745a7c4cd@grsecurity.net> (raw)
In-Reply-To: <xmqqttyejc7y.fsf@gitster.g>

On 21.03.23 17:33, Junio C Hamano wrote:
> [jc: on the CC: line, summoned a few people who may know a lot more
> about pcre than I do]
> 
> Stephane Odul <stephane@clumio.com> writes:
> 
>> We have a CI pipeline on a private repository that started failing
>> consistently while running `git grep -P` commands. The command
>> fails with an exit code of -11 and is failing pretty
>> consistently. With prior versions of git there is no issue
>> whatsoever, but with 2.40.0 it always fails on the same
>> call. Other git grep calls are fine, but that one is failing
>> consistently.
>>
>> I was not able to reproduce locally but my main machine is an M1
>> MacBook Pro, the CI pipeline runs under Kubernetes in AWS and the
>> container is based on Ubuntu 20.04 with the git client installed
>> via the PPA.
>>
>> The error is for this pattern `git grep -cP '^\w+ = json.load'`.
>>
>> As a workaround we tried to download and install the microsoft-git
>> v2.39.2 deb package since it allows us to downgrade, but then the
>> git grep commands just got stuck.
> 
> One "grep -P" related change we had recently between 2.39 and 2.40
> was
> 
>     50b6ad55 (grep: fall back to interpreter if JIT memory
>     allocation fails, 2023-01-31)
> 
> The code tries to figure out at runtime if pcre engine has
> functioning JIT by making an extra JIT compilation of a sample
> pattern and when it fails with a specific reason, fall back to
> interpreted pattern matching.  The change made the code to emit a
> bit more detailed information when it fails, but a controlled exit
> from the codepath should give $?=128, not 11.
> 
> So the above commit may or may not be related.  It could be that the
> version of pcre library linked to Git 2.40 and older Git you are
> running in your CI environment has been updated.

I don't think it's that commit. I just tried the pattern here on a system that
would require the interpreter fallback and one that wouldn't and both didn't
error out (Debian bookworm/sid; libpcre2-8: 10.42-1).

Looking at the history, though, another commit stands out:

  $ git log --oneline --no-merges v2.39.0..v2.40.0 -- grep.c
  fb2ebe72a374 grep API: plug memory leaks by freeing "header_list"
  891c9965fbc0 grep.c: refactor free_grep_patterns()
  50b6ad55b04d grep: fall back to interpreter if JIT memory allocation fails
  acabd2048ee0 grep: correctly identify utf-8 characters with \{b,w} in -P

Commit acabd2048ee0 suspiciously matches the pattern with its "\w" mention.
Looking at the diff gives:

        if (!opt->ignore_locale && is_utf8_locale() && !literal)
-               options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+               options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);


So maybe the PCRE2_UCP option isn't supported by the system in question or it
forces PCRE2 to do a verification it didn't do before?

The exit code of -11 could relate to PCRE2's error PCRE2_ERROR_UTF8_ERR9 (-11)
strengthening the suspicion it's related to that change. It's described as:

  PCRE2_ERROR_UTF8_ERR9   5th-byte's two top bits are not 0x80

Maybe the input data isn't valid UTF-8 encoded?

Stephane, can you pin point the file that the grep is failing on and verify it
is indeed a valid UFT-8 encoded file via:

  $ iconv -f UTF-8 your_file > /dev/null && echo OK || echo "Not valid"

If this prints "OK", we need to investigate further. However, if it prints
"Not valid", a broken file encoding is probably the cause and we should
revisit the above change and think of alternatives, as this might be seen as
a regression (even though the file isn't strictly valid UTF-8).

Thanks,
Mathias

> 
> Does it make a difference if you disable JIT by prefixing the
> pattern with (*NO_JIT)?
> 
> Thanks.

  reply	other threads:[~2023-03-21 19:21 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-21  8:04 Suspected git grep regression in git 2.40.0 Stephane Odul
2023-03-21 12:33 ` Bagas Sanjaya
2023-03-21 16:33 ` Junio C Hamano
2023-03-21 19:20   ` Mathias Krause [this message]
2023-03-21 20:46     ` [EXTERNAL SENDER] " Stephane Odul
2023-03-22 19:52       ` Mathias Krause
2023-03-22 20:04         ` [EXTERNAL SENDER] " Stephane Odul
2023-03-23 14:40         ` Suspected git grep regression in git 2.40.0 - proposed fix Mathias Krause
2023-03-23 16:19           ` Junio C Hamano
2023-03-23 16:36             ` Mathias Krause
2023-03-23 17:25           ` [PATCH v2] grep: work around UTF-8 related JIT bug in PCRE2 <= 10.34 Mathias Krause

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b0f4b588-9871-8e59-e5a2-3f8745a7c4cd@grsecurity.net \
    --to=minipli@grsecurity.net \
    --cc=avarab@gmail.com \
    --cc=carenas@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=stephane@clumio.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).