git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Suspected git grep regression in git 2.40.0
@ 2023-03-21  8:04 Stephane Odul
  2023-03-21 12:33 ` Bagas Sanjaya
  2023-03-21 16:33 ` Junio C Hamano
  0 siblings, 2 replies; 11+ messages in thread
From: Stephane Odul @ 2023-03-21  8:04 UTC (permalink / raw)
  To: git

We have a CI pipeline on a private repository that started failing consistently while running `git grep -P` commands. The command fails with an exit code of -11 and is failing pretty consistently. With prior versions of git there is no issue whatsoever, but with 2.40.0 it always fails on the same call. Other git grep calls are fine, but that one is failing consistently.

I was not able to reproduce locally but my main machine is an M1 MacBook Pro, the CI pipeline runs under Kubernetes in AWS and the container is based on Ubuntu 20.04 with the git client installed via the PPA.

The error is for this pattern `git grep -cP '^\w+ = json.load'`.

As a  workaround we tried to download and install the microsoft-git v2.39.2 deb package since it allows us to downgrade, but then the git grep commands just got stuck.

Thank you,
Stephane Odul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Suspected git grep regression in git 2.40.0
  2023-03-21  8:04 Suspected git grep regression in git 2.40.0 Stephane Odul
@ 2023-03-21 12:33 ` Bagas Sanjaya
  2023-03-21 16:33 ` Junio C Hamano
  1 sibling, 0 replies; 11+ messages in thread
From: Bagas Sanjaya @ 2023-03-21 12:33 UTC (permalink / raw)
  To: Stephane Odul, git

On 3/21/23 15:04, Stephane Odul wrote:
> We have a CI pipeline on a private repository that started failing consistently while running `git grep -P` commands. The command fails with an exit code of -11 and is failing pretty consistently. With prior versions of git there is no issue whatsoever, but with 2.40.0 it always fails on the same call. Other git grep calls are fine, but that one is failing consistently.
> 
> I was not able to reproduce locally but my main machine is an M1 MacBook Pro, the CI pipeline runs under Kubernetes in AWS and the container is based on Ubuntu 20.04 with the git client installed via the PPA.
> 
> The error is for this pattern `git grep -cP '^\w+ = json.load'`.
> 
> As a  workaround we tried to download and install the microsoft-git v2.39.2 deb package since it allows us to downgrade, but then the git grep commands just got stuck.
> 

Can you do the bisection between v2.39.0..v2.40.0?

-- 
An old man doll... just what I always wanted! - Clara


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Suspected git grep regression in git 2.40.0
  2023-03-21  8:04 Suspected git grep regression in git 2.40.0 Stephane Odul
  2023-03-21 12:33 ` Bagas Sanjaya
@ 2023-03-21 16:33 ` Junio C Hamano
  2023-03-21 19:20   ` Mathias Krause
  1 sibling, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2023-03-21 16:33 UTC (permalink / raw)
  To: Stephane Odul
  Cc: git, Carlo Marcelo Arenas Belón, Mathias Krause,
	Ævar Arnfjörð Bjarmason

[jc: on the CC: line, summoned a few people who may know a lot more
about pcre than I do]

Stephane Odul <stephane@clumio.com> writes:

> We have a CI pipeline on a private repository that started failing
> consistently while running `git grep -P` commands. The command
> fails with an exit code of -11 and is failing pretty
> consistently. With prior versions of git there is no issue
> whatsoever, but with 2.40.0 it always fails on the same
> call. Other git grep calls are fine, but that one is failing
> consistently.
>
> I was not able to reproduce locally but my main machine is an M1
> MacBook Pro, the CI pipeline runs under Kubernetes in AWS and the
> container is based on Ubuntu 20.04 with the git client installed
> via the PPA.
>
> The error is for this pattern `git grep -cP '^\w+ = json.load'`.
>
> As a workaround we tried to download and install the microsoft-git
> v2.39.2 deb package since it allows us to downgrade, but then the
> git grep commands just got stuck.

One "grep -P" related change we had recently between 2.39 and 2.40
was

    50b6ad55 (grep: fall back to interpreter if JIT memory
    allocation fails, 2023-01-31)

The code tries to figure out at runtime if pcre engine has
functioning JIT by making an extra JIT compilation of a sample
pattern and when it fails with a specific reason, fall back to
interpreted pattern matching.  The change made the code to emit a
bit more detailed information when it fails, but a controlled exit
from the codepath should give $?=128, not 11.

So the above commit may or may not be related.  It could be that the
version of pcre library linked to Git 2.40 and older Git you are
running in your CI environment has been updated.

Does it make a difference if you disable JIT by prefixing the
pattern with (*NO_JIT)?

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Suspected git grep regression in git 2.40.0
  2023-03-21 16:33 ` Junio C Hamano
@ 2023-03-21 19:20   ` Mathias Krause
  2023-03-21 20:46     ` [EXTERNAL SENDER] " Stephane Odul
  0 siblings, 1 reply; 11+ messages in thread
From: Mathias Krause @ 2023-03-21 19:20 UTC (permalink / raw)
  To: Junio C Hamano, Stephane Odul
  Cc: git, Carlo Marcelo Arenas Belón,
	Ævar Arnfjörð Bjarmason

On 21.03.23 17:33, Junio C Hamano wrote:
> [jc: on the CC: line, summoned a few people who may know a lot more
> about pcre than I do]
> 
> Stephane Odul <stephane@clumio.com> writes:
> 
>> We have a CI pipeline on a private repository that started failing
>> consistently while running `git grep -P` commands. The command
>> fails with an exit code of -11 and is failing pretty
>> consistently. With prior versions of git there is no issue
>> whatsoever, but with 2.40.0 it always fails on the same
>> call. Other git grep calls are fine, but that one is failing
>> consistently.
>>
>> I was not able to reproduce locally but my main machine is an M1
>> MacBook Pro, the CI pipeline runs under Kubernetes in AWS and the
>> container is based on Ubuntu 20.04 with the git client installed
>> via the PPA.
>>
>> The error is for this pattern `git grep -cP '^\w+ = json.load'`.
>>
>> As a workaround we tried to download and install the microsoft-git
>> v2.39.2 deb package since it allows us to downgrade, but then the
>> git grep commands just got stuck.
> 
> One "grep -P" related change we had recently between 2.39 and 2.40
> was
> 
>     50b6ad55 (grep: fall back to interpreter if JIT memory
>     allocation fails, 2023-01-31)
> 
> The code tries to figure out at runtime if pcre engine has
> functioning JIT by making an extra JIT compilation of a sample
> pattern and when it fails with a specific reason, fall back to
> interpreted pattern matching.  The change made the code to emit a
> bit more detailed information when it fails, but a controlled exit
> from the codepath should give $?=128, not 11.
> 
> So the above commit may or may not be related.  It could be that the
> version of pcre library linked to Git 2.40 and older Git you are
> running in your CI environment has been updated.

I don't think it's that commit. I just tried the pattern here on a system that
would require the interpreter fallback and one that wouldn't and both didn't
error out (Debian bookworm/sid; libpcre2-8: 10.42-1).

Looking at the history, though, another commit stands out:

  $ git log --oneline --no-merges v2.39.0..v2.40.0 -- grep.c
  fb2ebe72a374 grep API: plug memory leaks by freeing "header_list"
  891c9965fbc0 grep.c: refactor free_grep_patterns()
  50b6ad55b04d grep: fall back to interpreter if JIT memory allocation fails
  acabd2048ee0 grep: correctly identify utf-8 characters with \{b,w} in -P

Commit acabd2048ee0 suspiciously matches the pattern with its "\w" mention.
Looking at the diff gives:

        if (!opt->ignore_locale && is_utf8_locale() && !literal)
-               options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+               options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);


So maybe the PCRE2_UCP option isn't supported by the system in question or it
forces PCRE2 to do a verification it didn't do before?

The exit code of -11 could relate to PCRE2's error PCRE2_ERROR_UTF8_ERR9 (-11)
strengthening the suspicion it's related to that change. It's described as:

  PCRE2_ERROR_UTF8_ERR9   5th-byte's two top bits are not 0x80

Maybe the input data isn't valid UTF-8 encoded?

Stephane, can you pin point the file that the grep is failing on and verify it
is indeed a valid UFT-8 encoded file via:

  $ iconv -f UTF-8 your_file > /dev/null && echo OK || echo "Not valid"

If this prints "OK", we need to investigate further. However, if it prints
"Not valid", a broken file encoding is probably the cause and we should
revisit the above change and think of alternatives, as this might be seen as
a regression (even though the file isn't strictly valid UTF-8).

Thanks,
Mathias

> 
> Does it make a difference if you disable JIT by prefixing the
> pattern with (*NO_JIT)?
> 
> Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [EXTERNAL SENDER] Suspected git grep regression in git 2.40.0
  2023-03-21 19:20   ` Mathias Krause
@ 2023-03-21 20:46     ` Stephane Odul
  2023-03-22 19:52       ` Mathias Krause
  0 siblings, 1 reply; 11+ messages in thread
From: Stephane Odul @ 2023-03-21 20:46 UTC (permalink / raw)
  To: Mathias Krause
  Cc: Junio C Hamano, git, Carlo Marcelo Arenas Belón,
	Ævar Arnfjörð Bjarmason

Thank you for looking into this so quickly.

I’m unable to reproduce locally reliably but I created a custom pipeline to reproduce more quickly.

Here are the things I found out.

* With the NO_JIT flag and limited to only python files (in our case we only want to grep on py files anyways):
  - git grep -c -P '(*NO_JIT)^[[:alnum:]_]+ = json.load' -- '*.py’
  This is snappy and works, no more error.

* Without the flag and the *.py restriction:
  - git grep -c -P '^[^ #][^#]+sys[.]argv’
    This did not fail but took almost 3m, big performance regression.
  - git grep -c -P '^[[:alnum:]_]+ = json.load’
    Crashed and returned -11. Stderr was empty so I have no idea on what file it failed.

 * With NO_JIT on all the files:
  - git grep -c -P '(*NO_JIT)^[[:alnum:]_]+ = json.load’
   This worked, that pattern is snappy but other patterns are very slow:
  - git grep -c -P '(*NO_JIT)^[^ #][^#]+sys[.]argv’
    Took 8m to complete.

 * Without the flag but only *.py.
    - git grep -c -P '^[[:alnum:]_]+ = json.load' -- '*.py’
     All the patterns run fast (under 1s), and no errors.


Note that I was trying -E and replaced \w with [[:alnum:]_] … I’ll need to revert that, but I don’t thing \w is the issue.

Overall I would say that the issue is likely because the patterns are run against a non ASCII file somewhere in the repo.
Our repo is fairly large with files in various formats, including potentially some binaries that would definitely not be proper UTF-8.

For now I have a good workaround which is to only check for *.py files, which we should have done in the first place. The NO_JIT flag slows down things significantly so we will not use it here.

Do you have any recommendation on how to identify which file(s) is causing the crash considering there is nothing in stderr?

Thanks!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Suspected git grep regression in git 2.40.0
  2023-03-21 20:46     ` [EXTERNAL SENDER] " Stephane Odul
@ 2023-03-22 19:52       ` Mathias Krause
  2023-03-22 20:04         ` [EXTERNAL SENDER] " Stephane Odul
  2023-03-23 14:40         ` Suspected git grep regression in git 2.40.0 - proposed fix Mathias Krause
  0 siblings, 2 replies; 11+ messages in thread
From: Mathias Krause @ 2023-03-22 19:52 UTC (permalink / raw)
  To: Stephane Odul, Carlo Marcelo Arenas Belón
  Cc: Junio C Hamano, git, Ævar Arnfjörð Bjarmason

On 21.03.23 21:46, Stephane Odul wrote:
> Thank you for looking into this so quickly.
> 
> I’m unable to reproduce locally reliably but I created a custom pipeline to reproduce more quickly.
> 
> Here are the things I found out.
> 
> * With the NO_JIT flag and limited to only python files (in our case we only want to grep on py files anyways):
>   - git grep -c -P '(*NO_JIT)^[[:alnum:]_]+ = json.load' -- '*.py’
>   This is snappy and works, no more error.
> 
> * Without the flag and the *.py restriction:
>   - git grep -c -P '^[^ #][^#]+sys[.]argv’
>     This did not fail but took almost 3m, big performance regression.
>   - git grep -c -P '^[[:alnum:]_]+ = json.load’
>     Crashed and returned -11. Stderr was empty so I have no idea on what file it failed.
> 
>  * With NO_JIT on all the files:
>   - git grep -c -P '(*NO_JIT)^[[:alnum:]_]+ = json.load’
>    This worked, that pattern is snappy but other patterns are very slow:
>   - git grep -c -P '(*NO_JIT)^[^ #][^#]+sys[.]argv’
>     Took 8m to complete.
> 
>  * Without the flag but only *.py.
>     - git grep -c -P '^[[:alnum:]_]+ = json.load' -- '*.py’
>      All the patterns run fast (under 1s), and no errors.
> 
> 
> Note that I was trying -E and replaced \w with [[:alnum:]_] … I’ll need to revert that, but I don’t thing \w is the issue.

Thank you for testing! But nothing of this gives a real clue. So I did
bite the bullet and set up a Ubuntu 20.04 VM, added the git PPA and
could mirror your observations of git crashing on the mentioned pattern:

  root@ubuntu-2004:~/git# git grep --threads=1 -cP '^\w+ = json.load'
  Segmentation fault (core dumped)

I changed the command slightly to run single-threaded to exclude *that*
can of worms ;) But as you can see, I could simply run the command on
the git.git tree and trigger the bug as well.

Anyhow, trying to look where git dies gave me no clue. The Ubuntu PPA,
unfortunately, lacks debug symbols, so I build my own package and ended
up with the same faulty behaviour:

  root@ubuntu-2004:~/git# gdb -q --args ~/git_pkg/git-2.40.0/git grep --threads=1 -cP '^\w+ = json.load'
  Reading symbols from /root/git_pkg/git-2.40.0/git...
  (gdb) run
  Starting program: /root/git_pkg/git-2.40.0/git grep --threads=1 -cP \^\\w+\ =\ json.load
  [Thread debugging using libthread_db enabled]

  Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
  [Detaching after fork from child process 812657]

  Program received signal SIGSEGV, Segmentation fault.
  0x00007ffff7a149d3 in ?? ()
  (gdb) bt
  #0  0x00007ffff7a149d3 in ?? ()
  #1  0x0000000000000000 in ?? ()

First thing to note is that there are no symbols in the backtrace and
that it's rather short (like, where's main()?). But maybe that's just
how the JIT stack works? I don't know, maybe Carlo does?

Looking at the mappings gives (highlighting the one with the current
instruction pointer):

  (gdb) info proc mappings
  process 812654
  Mapped address spaces:

            Start Addr           End Addr       Size     Offset  Perms  objfile
        0x555555554000     0x555555572000    0x1e000        0x0  r--p   /root/git_pkg/git-2.40.0/git
        0x555555572000     0x5555558b3000   0x341000    0x1e000  r-xp   /root/git_pkg/git-2.40.0/git
        0x5555558b3000     0x5555559ae000    0xfb000   0x35f000  r--p   /root/git_pkg/git-2.40.0/git
        0x5555559af000     0x5555559b2000     0x3000   0x45a000  r--p   /root/git_pkg/git-2.40.0/git
        0x5555559b2000     0x5555559c8000    0x16000   0x45d000  rw-p   /root/git_pkg/git-2.40.0/git
        0x5555559c8000     0x555555a50000    0x88000        0x0  rw-p   [heap]
        0x7ffff7810000     0x7ffff79a0000   0x190000        0x0  rw-p

        0x7ffff7a05000     0x7ffff7a15000    0x10000        0x0  rwxp

That's where we're currently fetching instructions from. As it's a rwx
mapping (and the only one), I presume it's JIT-generated code.

        0x7ffff7a15000     0x7ffff7cfb000   0x2e6000        0x0  r--p   /usr/lib/locale/locale-archive
        0x7ffff7cfb000     0x7ffff7cfd000     0x2000        0x0  rw-p
        0x7ffff7cfd000     0x7ffff7d1f000    0x22000        0x0  r--p   /usr/lib/x86_64-linux-gnu/libc-2.31.so
        0x7ffff7d1f000     0x7ffff7e97000   0x178000    0x22000  r-xp   /usr/lib/x86_64-linux-gnu/libc-2.31.so
        0x7ffff7e97000     0x7ffff7ee5000    0x4e000   0x19a000  r--p   /usr/lib/x86_64-linux-gnu/libc-2.31.so
        0x7ffff7ee5000     0x7ffff7ee9000     0x4000   0x1e7000  r--p   /usr/lib/x86_64-linux-gnu/libc-2.31.so
        0x7ffff7ee9000     0x7ffff7eeb000     0x2000   0x1eb000  rw-p   /usr/lib/x86_64-linux-gnu/libc-2.31.so
        0x7ffff7eeb000     0x7ffff7eef000     0x4000        0x0  rw-p
        0x7ffff7eef000     0x7ffff7ef5000     0x6000        0x0  r--p   /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
        0x7ffff7ef5000     0x7ffff7f06000    0x11000     0x6000  r-xp   /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
        0x7ffff7f06000     0x7ffff7f0c000     0x6000    0x17000  r--p   /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
        0x7ffff7f0c000     0x7ffff7f0d000     0x1000    0x1c000  r--p   /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
        0x7ffff7f0d000     0x7ffff7f0e000     0x1000    0x1d000  rw-p   /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
        0x7ffff7f0e000     0x7ffff7f12000     0x4000        0x0  rw-p
        0x7ffff7f12000     0x7ffff7f14000     0x2000        0x0  r--p   /usr/lib/x86_64-linux-gnu/libz.so.1.2.11
        0x7ffff7f14000     0x7ffff7f25000    0x11000     0x2000  r-xp   /usr/lib/x86_64-linux-gnu/libz.so.1.2.11
        0x7ffff7f25000     0x7ffff7f2b000     0x6000    0x13000  r--p   /usr/lib/x86_64-linux-gnu/libz.so.1.2.11
        0x7ffff7f2b000     0x7ffff7f2c000     0x1000    0x19000  ---p   /usr/lib/x86_64-linux-gnu/libz.so.1.2.11
        0x7ffff7f2c000     0x7ffff7f2d000     0x1000    0x19000  r--p   /usr/lib/x86_64-linux-gnu/libz.so.1.2.11
        0x7ffff7f2d000     0x7ffff7f2e000     0x1000    0x1a000  rw-p   /usr/lib/x86_64-linux-gnu/libz.so.1.2.11
        0x7ffff7f2e000     0x7ffff7f30000     0x2000        0x0  r--p   /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0.9.0
        0x7ffff7f30000     0x7ffff7f95000    0x65000     0x2000  r-xp   /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0.9.0
        0x7ffff7f95000     0x7ffff7fbd000    0x28000    0x67000  r--p   /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0.9.0
        0x7ffff7fbd000     0x7ffff7fbe000     0x1000    0x8e000  r--p   /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0.9.0
        0x7ffff7fbe000     0x7ffff7fbf000     0x1000    0x8f000  rw-p   /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0.9.0
        0x7ffff7fbf000     0x7ffff7fc1000     0x2000        0x0  rw-p
        0x7ffff7fcb000     0x7ffff7fce000     0x3000        0x0  r--p   [vvar]
        0x7ffff7fce000     0x7ffff7fcf000     0x1000        0x0  r-xp   [vdso]
        0x7ffff7fcf000     0x7ffff7fd0000     0x1000        0x0  r--p   /usr/lib/x86_64-linux-gnu/ld-2.31.so
        0x7ffff7fd0000     0x7ffff7ff3000    0x23000     0x1000  r-xp   /usr/lib/x86_64-linux-gnu/ld-2.31.so
        0x7ffff7ff3000     0x7ffff7ffb000     0x8000    0x24000  r--p   /usr/lib/x86_64-linux-gnu/ld-2.31.so
        0x7ffff7ffc000     0x7ffff7ffd000     0x1000    0x2c000  r--p   /usr/lib/x86_64-linux-gnu/ld-2.31.so
        0x7ffff7ffd000     0x7ffff7ffe000     0x1000    0x2d000  rw-p   /usr/lib/x86_64-linux-gnu/ld-2.31.so
        0x7ffff7ffe000     0x7ffff7fff000     0x1000        0x0  rw-p
        0x7ffffffde000     0x7ffffffff000    0x21000        0x0  rw-p   [stack]
    0xffffffffff600000 0xffffffffff601000     0x1000        0x0  --xp   [vsyscall]

The faulting instruction is:

  (gdb) x/i $pc
  => 0x7ffff7a149d3:	movzwq (%rcx,%r9,1),%rcx
  (gdb) info registers rcx r9
  rcx            0x3fffffffffffffe   288230376151711742
  r9             0x7ffff7fae680      140737353803392

It tries to read from the non-canonical address 0x04007ffff7fae67e.
Both, the base value of the instruction and the index look way off,
which makes me suspect, what we see here is a JIT compiler bug. :/

Carlo, does that ring a bell? The libpcre2 version I'm using here is
10.34-7ubuntu0.1 (default version in Ubuntu 20.04).

Further testing also let me confirm my prior suspicion that it is,
indeed, commit acabd2048ee0 ("grep: correctly identify utf-8 characters
with \{b,w} in -P") that is causing the faulty behaviour. Reverting
that commit and rebuilding the package leads to a working git binary.

Carlo or any other more familiar with the PCRE2 library, does that ring
a bell? Did PCRE2 had a bug related to JIT and PCRE2_UCP?

> 
> Overall I would say that the issue is likely because the patterns are run against a non ASCII file somewhere in the repo.
> Our repo is fairly large with files in various formats, including potentially some binaries that would definitely not be proper UTF-8.
> 
> For now I have a good workaround which is to only check for *.py files, which we should have done in the first place. The NO_JIT flag slows down things significantly so we will not use it here.
> 
> Do you have any recommendation on how to identify which file(s) is causing the crash considering there is nothing in stderr?

strace ;) But as I could reproduce the issue too, it's no longer needed.

Thanks,
Mathias

> 
> Thanks!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [EXTERNAL SENDER] Suspected git grep regression in git 2.40.0
  2023-03-22 19:52       ` Mathias Krause
@ 2023-03-22 20:04         ` Stephane Odul
  2023-03-23 14:40         ` Suspected git grep regression in git 2.40.0 - proposed fix Mathias Krause
  1 sibling, 0 replies; 11+ messages in thread
From: Stephane Odul @ 2023-03-22 20:04 UTC (permalink / raw)
  To: Mathias Krause
  Cc: Carlo Marcelo Arenas Belón, Junio C Hamano, git,
	Ævar Arnfjörð Bjarmason

This is great work Mathias!

Now that you have an easy way to reproduce, and I have a good workaround, I do not believe there is much more I can contribute and will leave this in your expert hands.

Thank you,
Stephane Odul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Suspected git grep regression in git 2.40.0 - proposed fix
  2023-03-22 19:52       ` Mathias Krause
  2023-03-22 20:04         ` [EXTERNAL SENDER] " Stephane Odul
@ 2023-03-23 14:40         ` Mathias Krause
  2023-03-23 16:19           ` Junio C Hamano
  2023-03-23 17:25           ` [PATCH v2] grep: work around UTF-8 related JIT bug in PCRE2 <= 10.34 Mathias Krause
  1 sibling, 2 replies; 11+ messages in thread
From: Mathias Krause @ 2023-03-23 14:40 UTC (permalink / raw)
  To: git, Junio C Hamano
  Cc: Mathias Krause, Stephane Odul, Carlo Marcelo Arenas Belón,
	Ævar Arnfjörð Bjarmason

I looked a little further and found an interesting entry in the PCRE2's
changelog for version 10.35:

https://github.com/PCRE2Project/pcre2/blob/pcre2-10.35/ChangeLog#L66:

  17. Fix a crash which occurs when the character type of an invalid UTF
  character is decoded in JIT.

Its fix commit https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
and associated unit test is basically doing the same as what Stephane is
running into: making the JIT compiled code choke on something that's not
a valid UTF-8 string.

So it looks like we're out of luck and have to implement yet another
quirk to handle these broken versions. We can either disable the JIT
compiler completely for these versions (and exchange the segfault for a
serve performance regression) or fall-back to the previous behaviour and
ignore Unicode properties (and reintroduce the bug commit acabd2048ee0
("grep: correctly identify utf-8 characters with \{b,w} in -P") wanted
to fix).

I went with the second option and could confirm the below patch fixes
the segfault on Ubuntu 20.04 which is shipping the broken version.

Junio, what's your call on it? Below patch or simply a revert of commit
acabd2048ee0? Other ideas?

Thanks,
Mathias

-- >8 --
Subject: [PATCH] grep: work around UTF-8 related JIT bug in PCRE2 <= 10.34

Stephane is reporting[1] a regression introduced in git v2.40.0 that leads
to 'git grep' segfaulting in his CI pipeline. It turns out, he's using an
older version of libpcre2 that triggers a wild pointer dereference in
the generated JIT code that was fixed in PCRE2 10.35.

Instead of completely disabling the JIT compiler for the buggy version,
just mask out the Unicode property handling as we used to do prior to
commit acabd2048ee0 ("grep: correctly identify utf-8 characters with
\{b,w} in -P").

[1] https://lore.kernel.org/git/7E83DAA1-F9A9-4151-8D07-D80EA6D59EEA@clumio.com/

Reported-by: Stephane Odul <stephane@clumio.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
---
 grep.c | 11 ++++++++++-
 grep.h |  3 +++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/grep.c b/grep.c
index cee44a78d044..d249beae60d0 100644
--- a/grep.c
+++ b/grep.c
@@ -317,8 +317,17 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (!opt->ignore_locale && is_utf8_locale() && !literal)
+	if (!opt->ignore_locale && is_utf8_locale() && !literal) {
 		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
+#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
+		/*
+		 * Work around a JIT bug related to invalid Unicode character
+		 * handling fixed in 10.35:
+		 * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
+		 */
+		options ^= PCRE2_UCP;
+#endif
+	}
 
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
 	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
diff --git a/grep.h b/grep.h
index 6075f997e68f..c59592e3bdba 100644
--- a/grep.h
+++ b/grep.h
@@ -7,6 +7,9 @@
 #if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 36) || PCRE2_MAJOR >= 11
 #define GIT_PCRE2_VERSION_10_36_OR_HIGHER
 #endif
+#if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 35) || PCRE2_MAJOR >= 11
+#define GIT_PCRE2_VERSION_10_35_OR_HIGHER
+#endif
 #if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 34) || PCRE2_MAJOR >= 11
 #define GIT_PCRE2_VERSION_10_34_OR_HIGHER
 #endif
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Suspected git grep regression in git 2.40.0 - proposed fix
  2023-03-23 14:40         ` Suspected git grep regression in git 2.40.0 - proposed fix Mathias Krause
@ 2023-03-23 16:19           ` Junio C Hamano
  2023-03-23 16:36             ` Mathias Krause
  2023-03-23 17:25           ` [PATCH v2] grep: work around UTF-8 related JIT bug in PCRE2 <= 10.34 Mathias Krause
  1 sibling, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2023-03-23 16:19 UTC (permalink / raw)
  To: Mathias Krause
  Cc: git, Stephane Odul, Carlo Marcelo Arenas Belón,
	Ævar Arnfjörð Bjarmason

Mathias Krause <minipli@grsecurity.net> writes:

> ... or fall-back to the previous behaviour and
> ignore Unicode properties (and reintroduce the bug commit acabd2048ee0
> ("grep: correctly identify utf-8 characters with \{b,w} in -P") wanted
> to fix).
>
> I went with the second option and could confirm the below patch fixes
> the segfault on Ubuntu 20.04 which is shipping the broken version.
>
> Junio, what's your call on it? Below patch or simply a revert of commit
> acabd2048ee0? Other ideas?

Thanks for all the investigation and a prompt fix.  Very much
appreciated.

As you identified where the breakage ended in the versions of pcre,
I think that the approach the patch I am responding to takes is more
sensible than a straight revert.

But I somehow suspect that it may be better to have the "workaround"
independent of the line acabd204 (grep: correctly identify utf-8
characters with \{b,w} in -P, 2023-01-08) touched, more like how we
"Work around ... fixed in 10.36".

IOW, not like this

>  		options |= PCRE2_CASELESS;
>  	}
> -	if (!opt->ignore_locale && is_utf8_locale() && !literal)
> +	if (!opt->ignore_locale && is_utf8_locale() && !literal) {
>  		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
> +#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
> +		/*
> +		 * Work around a JIT bug related to invalid Unicode character
> +		 * handling fixed in 10.35:
> +		 * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
> +		 */
> +		options ^= PCRE2_UCP;
> +#endif
> +	}

but more like

 	if (!opt->ignore_locale && is_utf8_locale() && !literal)
 		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
 
+#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
+	/*
+	 * Work around a JIT bug related to invalid Unicode character
+	 * handling fixed in 10.35:
+	 * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
+	 */
+	options ^= PCRE2_UCP;
+#endif
+
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
 	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
 	if (PCRE2_MATCH_INVALID_UTF && options & (PCRE2_UTF | PCRE2_CASELESS))

That way, no matter how we ended up setting the UCP bit in the
options variable, we would avoid broken UCP handling on problematic
versions, no?

>  #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
>  	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
> diff --git a/grep.h b/grep.h
> index 6075f997e68f..c59592e3bdba 100644
> --- a/grep.h
> +++ b/grep.h
> @@ -7,6 +7,9 @@
>  #if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 36) || PCRE2_MAJOR >= 11
>  #define GIT_PCRE2_VERSION_10_36_OR_HIGHER
>  #endif
> +#if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 35) || PCRE2_MAJOR >= 11
> +#define GIT_PCRE2_VERSION_10_35_OR_HIGHER
> +#endif
>  #if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 34) || PCRE2_MAJOR >= 11
>  #define GIT_PCRE2_VERSION_10_34_OR_HIGHER
>  #endif

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Suspected git grep regression in git 2.40.0 - proposed fix
  2023-03-23 16:19           ` Junio C Hamano
@ 2023-03-23 16:36             ` Mathias Krause
  0 siblings, 0 replies; 11+ messages in thread
From: Mathias Krause @ 2023-03-23 16:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Stephane Odul, Carlo Marcelo Arenas Belón,
	Ævar Arnfjörð Bjarmason

On 23.03.23 17:19, Junio C Hamano wrote:
> Mathias Krause <minipli@grsecurity.net> writes:
> 
>> ... or fall-back to the previous behaviour and
>> ignore Unicode properties (and reintroduce the bug commit acabd2048ee0
>> ("grep: correctly identify utf-8 characters with \{b,w} in -P") wanted
>> to fix).
>>
>> I went with the second option and could confirm the below patch fixes
>> the segfault on Ubuntu 20.04 which is shipping the broken version.
>>
>> Junio, what's your call on it? Below patch or simply a revert of commit
>> acabd2048ee0? Other ideas?
> 
> Thanks for all the investigation and a prompt fix.  Very much
> appreciated.

As there was the slight chance it could be caused by my grep.c change, I
had the ambition to prove it either way. :D

> As you identified where the breakage ended in the versions of pcre,
> I think that the approach the patch I am responding to takes is more
> sensible than a straight revert.

Ok. It's really unfortunate that PCRE2 has so many bugs we need to work
around :/

> But I somehow suspect that it may be better to have the "workaround"
> independent of the line acabd204 (grep: correctly identify utf-8
> characters with \{b,w} in -P, 2023-01-08) touched, more like how we
> "Work around ... fixed in 10.36".
> 
> IOW, not like this
> 
>>  		options |= PCRE2_CASELESS;
>>  	}
>> -	if (!opt->ignore_locale && is_utf8_locale() && !literal)
>> +	if (!opt->ignore_locale && is_utf8_locale() && !literal) {
>>  		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
>> +#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
>> +		/*
>> +		 * Work around a JIT bug related to invalid Unicode character
>> +		 * handling fixed in 10.35:
>> +		 * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
>> +		 */
>> +		options ^= PCRE2_UCP;
>> +#endif
>> +	}
> 
> but more like
> 
>  	if (!opt->ignore_locale && is_utf8_locale() && !literal)
>  		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
>  
> +#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
> +	/*
> +	 * Work around a JIT bug related to invalid Unicode character
> +	 * handling fixed in 10.35:
> +	 * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
> +	 */
> +	options ^= PCRE2_UCP;
> +#endif
> +
>  #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
>  	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
>  	if (PCRE2_MATCH_INVALID_UTF && options & (PCRE2_UTF | PCRE2_CASELESS))
> 
> That way, no matter how we ended up setting the UCP bit in the
> options variable, we would avoid broken UCP handling on problematic
> versions, no?

Ah, sure! It's no issue with the current code base, as PCRE2_UCP only
gets set in that branch, but there's no need to narrow the workaround to
it as well.

I'll send a v2 in a few!

Thanks,
Mathias

> 
>>  #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
>>  	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
>> diff --git a/grep.h b/grep.h
>> index 6075f997e68f..c59592e3bdba 100644
>> --- a/grep.h
>> +++ b/grep.h
>> @@ -7,6 +7,9 @@
>>  #if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 36) || PCRE2_MAJOR >= 11
>>  #define GIT_PCRE2_VERSION_10_36_OR_HIGHER
>>  #endif
>> +#if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 35) || PCRE2_MAJOR >= 11
>> +#define GIT_PCRE2_VERSION_10_35_OR_HIGHER
>> +#endif
>>  #if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 34) || PCRE2_MAJOR >= 11
>>  #define GIT_PCRE2_VERSION_10_34_OR_HIGHER
>>  #endif

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] grep: work around UTF-8 related JIT bug in PCRE2 <= 10.34
  2023-03-23 14:40         ` Suspected git grep regression in git 2.40.0 - proposed fix Mathias Krause
  2023-03-23 16:19           ` Junio C Hamano
@ 2023-03-23 17:25           ` Mathias Krause
  1 sibling, 0 replies; 11+ messages in thread
From: Mathias Krause @ 2023-03-23 17:25 UTC (permalink / raw)
  To: git, Junio C Hamano
  Cc: Mathias Krause, Stephane Odul, Carlo Marcelo Arenas Belón,
	Ævar Arnfjörð Bjarmason

Stephane is reporting[1] a regression introduced in git v2.40.0 that leads
to 'git grep' segfaulting in his CI pipeline. It turns out, he's using an
older version of libpcre2 that triggers a wild pointer dereference in
the generated JIT code that was fixed in PCRE2 10.35.

Instead of completely disabling the JIT compiler for the buggy version,
just mask out the Unicode property handling as we used to do prior to
commit acabd2048ee0 ("grep: correctly identify utf-8 characters with
\{b,w} in -P").

[1] https://lore.kernel.org/git/7E83DAA1-F9A9-4151-8D07-D80EA6D59EEA@clumio.com/

Reported-by: Stephane Odul <stephane@clumio.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
---
v2:
- make PCRE2_UCP masking depend only on the PCRE2 version, as
  suggested by Junio

 grep.c | 9 +++++++++
 grep.h | 3 +++
 2 files changed, 12 insertions(+)

diff --git a/grep.c b/grep.c
index cee44a78d044..dcfa7a27bf88 100644
--- a/grep.c
+++ b/grep.c
@@ -320,6 +320,15 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 	if (!opt->ignore_locale && is_utf8_locale() && !literal)
 		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
 
+#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
+	/*
+	 * Work around a JIT bug related to invalid Unicode character handling
+	 * fixed in 10.35:
+	 * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
+	 */
+	options &= ~PCRE2_UCP;
+#endif
+
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
 	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
 	if (PCRE2_MATCH_INVALID_UTF && options & (PCRE2_UTF | PCRE2_CASELESS))
diff --git a/grep.h b/grep.h
index 6075f997e68f..c59592e3bdba 100644
--- a/grep.h
+++ b/grep.h
@@ -7,6 +7,9 @@
 #if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 36) || PCRE2_MAJOR >= 11
 #define GIT_PCRE2_VERSION_10_36_OR_HIGHER
 #endif
+#if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 35) || PCRE2_MAJOR >= 11
+#define GIT_PCRE2_VERSION_10_35_OR_HIGHER
+#endif
 #if (PCRE2_MAJOR >= 10 && PCRE2_MINOR >= 34) || PCRE2_MAJOR >= 11
 #define GIT_PCRE2_VERSION_10_34_OR_HIGHER
 #endif
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-03-23 17:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-21  8:04 Suspected git grep regression in git 2.40.0 Stephane Odul
2023-03-21 12:33 ` Bagas Sanjaya
2023-03-21 16:33 ` Junio C Hamano
2023-03-21 19:20   ` Mathias Krause
2023-03-21 20:46     ` [EXTERNAL SENDER] " Stephane Odul
2023-03-22 19:52       ` Mathias Krause
2023-03-22 20:04         ` [EXTERNAL SENDER] " Stephane Odul
2023-03-23 14:40         ` Suspected git grep regression in git 2.40.0 - proposed fix Mathias Krause
2023-03-23 16:19           ` Junio C Hamano
2023-03-23 16:36             ` Mathias Krause
2023-03-23 17:25           ` [PATCH v2] grep: work around UTF-8 related JIT bug in PCRE2 <= 10.34 Mathias Krause

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).